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1 Introduction 


Machine learning is increasingly used in practical applications that can be 
categorized as safety-critical, such as Al-assisted driving. In this context, we 
recently considered the problem of driver monitoring, which plays an essential 
part in avoiding accidents by warning the driver in time and shifting the driver’s 
attention to the traffic scenery in critical situations [1]. This may apply for 
the different levels of automated driving, for take-over requests as well as 
for driving in manual mode. More specifically, we tackled the problem of 
predicting the driver’s gazing direction. Distinguishing eight different regions, 
this problem can be formalized as a classification task, in which each region 
corresponds to a class (cf. Figure 1). We proposed a deep learning approach to 
predict gaze regions, which is based on informative features such as eye land- 
marks and head pose angles of the driver. Moreover, we introduced different 
post-processing techniques that improve the accuracy by exploiting temporal 
information from videos and the availability of other vehicle signals. Our main 
interest is to leverage accurate gaze prediction for improved human-computer- 
interaction. In this regard, it is arguably important to guarantee a certain level 
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Figure 1: Spatial zones distinguished in the driver gaze classification task. 


of awareness of the computer (AI system) of its own certainty or uncertainty in 
a prediction [3]. 


In this work, we therefore leverage so-called conformal prediction (CP) to 
increase the reliability of such predictions [10, 12]. Instead of predicting a 
single class, CP produces a set-valued prediction, i.e., a subset of all candidate 
classes that comprises the true class with high probability. This way, the system 
is able to express ambiguities (several regions appear plausible, because the 
driver’s gaze cannot be determined precisely) as well as a partial or complete 
lack of knowledge — for example, the driver may look in a completely different 
direction, which does not correspond to any of the eight pre-specified regions 
(thereby producing so-called out-of-distribution data). 


Conformal prediction can be seen as a meta-learning technique, which can be 
put on top of any base learner, i.e., any standard classifier producing “point 
predictions.” It merely requires a measure of (non-)conformity of a (hypotheti- 
cal) data point, i.e., a measure of how well a combination of feature values and 
gaze directions fits with the training data seen so far. While the required level 
of confidence — the predicted set contains the true class with a pre-specified 
probability (such as 95%) — is guaranteed regardless of the conformity mea- 
sure, the latter has a strong influence on the precision of predictions, i.e., the 
(average) size of the predicted sets. 


In this work, we evaluate different types of conformity scores to construct con- 
formal predictors for driver gaze classification, including scores derived from 
kernel density estimation as proposed in [2, 5], and compare them with regard 
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to the quality of set-valued predictions as well as time efficiency. Moreover, 
we elaborate on a specific characteristic of our problem, namely the fact that 
our output space exhibits a natural (topological) structure induced by the spa- 
tial relationship between the classes — unlike standard classification problems, 
where the classes constitute a simple set with no relationships between its 
elements. As a consequence, there are more meaningful (viz. topologically 
connected) and less meaningful (unconnected) set-valued predictions. To as- 
sure semantically meaningful set-valued predictions, we propose an extension 
of standard CP. 


The work is structured as follows. In Section 2, we shortly review the gaze 
classification system with its results. Section 3 explains the conformal pre- 
diction method more closely. In Section 4, we apply the method to the gaze 
dataset. Section 5 discusses the results while Section 6 concludes with some 
final remarks. 


2 Gaze Classification 


The problem of driver monitoring was recently studied in [1]. This section 
briefly summarizes the method used and the key results obtained. For detailed 
information, the interested reader is referred to [1]. 


2.1 Problem Statement and Dataset 


The goal of the gaze classifier is to reliably classify the region the driver is 
looking at, based on an image of the driver. Certain regions are of special 
interest and are displayed in Figure 1. The underlying dataset was extracted 
from a naturalistic driving study in which participants were driving a car for 
several months while being recorded with an RGB camera installed at the A- 
pillar. Sample images are provided in Figure 2. The examined dataset consists 
of 75 video snippets from 20 subjects (5 female, 15 males). Driver videos were 
recorded in size 980 x 540 at 15 frames per second. The important regions of 
interest (also classes, labels) with the number of images available are given in 
table 1. 
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Table 1: Number of classes 


Class Abbr. Number 


left shoulder Is 98 
left mirror Im 967 
speedometer sp 228 
front f 2,296 
inner mirror inm 713 
center console cc 332 
right shoulder IS 72 
right mirror rm 356 


Figure 2: Example images from the dataset. 


2.2 Method 


The pipeline of the gaze classification system is depicted in Figure 3. In a 
pre-processing step, meaningful features on the driver’s head pose and the eyes 
are generated from existing image-based methods and then fed into a fully 
connected neural net. For an inserted image, the driver’s face is detected [4] 
and the three head pose angles are computed [8]. The angles describe the 
orientation of the head, where the rotation around the x-axis is called pitch (i.e. 
from up to down), around the y-axis yaw (i.e. from left to right), and around 
the z-axis roll (i.e. from left to right shoulder). For the eyes, the eye landmark 
detector by Park et al. [7] is employed. We make use of 15 landmarks per 
eye that describe the eyelid and the iris. The generated features are fed as 
input to a neural network architecture. The output of this network is scaled by 
the softmax-activation function, which produces a vector with probabilities for 
each class. 
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Figure 3: Pipeline of the Gaze classification system. 
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Figure 4: Results after Cross-validation [1]. 


2.3 Results 


As there might be driver-dependent characteristics in the features, we propose 
training with a leave one-driver out cross-validation, training on 19 drivers, 
while testing on the remaining driver. The results from the test set after every 
iteration are aggregated into the confusion matrix given in Figure 4. In total, 
the model achieves an accuracy of 87.1%. After aggregating the classes from 
the left and the right side, as well as the speedometer with the front class, 
accuracy increases to 91.4%. Misclassification occurs for the classes front and 


inner mirror, as well as for inner mirror, front and the right side. 


2.4 Discussion 


In general, the error rate of 12.9% can be narrowed down to three types of 
misclassifications: (i) misclassifications between similar classes that can be 
aggregated together without a higher loss of information (e.g., right shoulder 
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and right mirror) (4.3%), (ii) misclassifications between classes far apart (e.g., 
left mirror and right mirror) which make up 1.4% and (iii) misclassificati- 
ons among classes close to one another. Indeed, there is a high number of 
misclassifications for the classes front, inner mirror, right mirror and center 
console. One can possibly assume that classes away from the camera are harder 
to perceive. If the driver is looking through the front windshield and directly 
beneath the inner mirror, e.g., while focusing on a vehicle far ahead on the right 
side, it becomes difficult from the camera point of view to correctly annotate 
this situation, for both, the human annotator and apparently also the system. 


3 Conformal Prediction 


In cases of uncertainty, set-valued predictions are supposed to produce reliable 
predictions, i.e., subsets of classes comprising the true one with high probabi- 
lity, very much like confidence intervals as known from classical statistics. One 
way of obtaining such sets is through conformal prediction [10, 12], which is 
based on the idea of reducing prediction to hypothesis testing: Given a query 
instance, a class label is included as a candidate in the set-valued prediction 
unless the hypothesis that this label corresponds to the ground truth can be 
rejected at a pre-specified level of confidence. The test itself relies on assig- 
ning each instance/label combination a measure of non-conformity, reflecting 
how “strange” this combination appears in light of the data seen so far. Its 
counterpart is the conformity measure, reflecting the similarity to the data seen 
so far. 


The original idea of CP was introduced for the setting of online learning [12]. 
Here, we present a version adapted to the standard setting of supervised lear- 
ning, called Inductive Conformal Prediction (ICP) [6]. We only focus on the 
case of classification, for which we have seen n examples in the training data 
and seek to predict the label of a new query instance. 


Formally, for previous observations (x1,y1),(2,y2),---;(%n,Yn) with 
(xi,9;) E Z = X x Y, a set-valued predictor TE : 2 — 2” is constructed on 
the basis of a permutation-invariant non-conformity measure 0: Z x Z"—R 
that indicates how strange a hypothetical example z = (x,y) € & is compared 
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to previous examples in a set A € 2”. The outputs of the non-conformity 
measure are called non-conformity scores and are formally described as 


= 


(zi, Ai) = $ (xi Yi) {Z15---Znti} \ {zi} (1) 


ọ 
Fa); (2) 


where fi : & — Y is the point prediction rule learned on A;. Then, for a new 
instance x,+1, all possible candidate values y € Y are considered for yn+1 by 
testing the hypothesis Ho : yn+1 = y (against Hı : Yn+1 Æ y) and computing the 
p-values 


_ LH: 2 On+ı} i 


3 
7 n+1 (3) 
The set-valued prediction is then given by 

170) = {y: py >E}. (4) 


Under some technical assumptions!, it can be shown that this prediction fulfills 


P (yn41 EI (Xn41)) > 1-e. (5) 


In ICP, the dataset is split into three parts: the training set, the calibration 
set and the testing set. The training set is used to train the point predictor f. 
The calibration set {zı,...,2„} is used to compute the non-conformity scores 
{1,..-,,} only once. For a new instance z,+; from the test dataset, the non- 
conformity score $,+1 is computed as usual: 


Qi = Q(z A\ {ziznn} Vie {1,...,n} (6) 
On41 = @(Zn41,A \ Zn41)- (7) 


Then, the non-conformity score @,+1 is compared to the scores from the cali- 
bration set to eventually compute its p-value according to (3). 


The guarantee (5) holds “on average”, that is, when assuming new samples 
(Xn+1;Yn+1) to be drawn according to the underlying probability measure on 


! A key assumption is the condition of exchangeability [12]. 
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Table 2: Non-Conformity Measures 


A) Kernel Density (KDE) o ((x,f),A) = (1 + pally = f)! 
B) Distance to mean (DTM) o ((x,f),A) = Kas — x| 


C) 1 Nearest Neighbour (INN) 9((x,5),A) = ma 


&. It does not hold, however, conditional to a specific class € Y, i.e., under 
the condition that y„+1 = F. In other words, predictions might be more valid for 
some (ground truth) classes and less for others. Therefore, in cases of strong 
class imbalance, where the set sizes vary too strongly among the classes for the 
same choice of confidence level | — €, it appears meaningful to choose &; for 
each class ¥ separately. This is also known as Mondrian Conformal Prediction 
[11]. 

In the following, we consider several measures for y„+1 = which are given in 
table 2. There, x4 5 is the mean over all x-vectors in A labeled with class f, and 
Xa y the instance in A with smallest (Euclidean) distance to x among those with 
label y. Moreover, 6,4 denotes the class-conditional density, estimated on the 
set A by means of kernel density estimation with Gaussian kernel learned. 


4 Results 


In this section, the results for the different non-conformity measures are repor- 
ted for the gaze dataset introduced earlier. For every new instance in the test 
set, the p-value py for each class y € Y is returned. The latter can be used in 
two different ways: (i) The class with the highest p-value is chosen as a point 
prediction (the p-value itself is then called the credibility of the prediction). 
(ii) For a given confidence level 1 — €, the set of labels T? is returned as a 
set-valued prediction. For (i) the error of this predictor is reported, while for 
(ii), the average size of the predicted sets (at different confidence levels) is of 
specific interest. 
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4.1 The Gaze Dataset 


Similar to [6, 2], we apply the method of conformal prediction by extracting 
the output of the neural network before applying the softmax function and use 
it as the input feature for ICP (cf. Figure 3). In this way, we circumvent the 
disadvantages of the softmax transformation [2]. For calibration, 300 instances 
per class (100 instances for the classes left shoulder and right shoulder) are 
sampled. The test set is a newly annotated dataset that consists of 5 videos 
with 3,138 frames. 


We report the results for the three conformity measures KDE, DTM, and INN. 
A stacked bar plot is employed to visualize the set sizes for each conformity 
measure. It is provided in Figure 5. The set sizes at confidence level 1 — € 
are displayed in different colors. The black graph corresponds to the accuracy 
of the single predictions while the red graph represents the accuracy of all 
predicted sets (including non-empty sets as well). More information is pro- 
vided in Table 3 with the average credibility of the class with the highest p- 
value. Furthermore, the table contains information on the average size of the 
non-empty sets and the accuracies for all sets at different confidence levels 
1 — € € {0.85,0.90,0.95,0.98}. 


From Table 3, it can be observed that the error of the point predictor is lowest at 
10.6% for the KDE measure. The other measures produce error rates between 
16.1% and 18.3%, while the error rate for the baseline gaze classifier with the 
softmax-generated output is at 15.2%. The favorable, i.e., the highest average 
credibility is reached by KDE and INN at 32.19% and 34.14%. From the 
plots in Figure 5, it can be noticed that there are only slight differences in the 
number of empty set predictions (colored in green) and single set predictions 
(in purple). The number of sets with more than one label is highest for all 
confidence levels for the measure INN. Table 3 shows that the average size 
of non-empty sets is always lowest for KDE. Also, the average set size for 
non-empty sets is for all three measures similar at lower confidence levels, 
e.g., 1— e = 0.85. With increasing confidence levels, the sizes vary more 
strongly, e.g., 4.81 for DTM and 1.69 for KDE at confidence level 1 — € = 0.98. 
The (statistical) guarantee (5) is met for both, INN and KDE. DTM misses 
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(A) KDE (B) DTM (C) INN 


4 m 6 m 8 @ Accuracy (All sets) 
5 mm7 © Accuracy (Single sets) 


Figure 5: Stacked bar graph that visualizes the set sizes. 


the required confidence level by a few percentage points at higher confidence 
levels. 


The average computational time per method, i.e., computing the conformity 
scores for the calibration set and the testing set, is shortest for DTM with 64 
seconds. For the KDE method and INN, 379 resp. 723 seconds are needed on 
average. 


4.2 Structure of Prediction Sets 


As the label space % consists of eight regions in our application, there are 
28 = 256 possible prediction sets that might be produced by CP. Even if all 
these sets are valid in a statistical sense, not all of them appear to be seman- 
tically meaningful. In fact, Y is not just a set of distinct classes. Instead, 
the classes are spatially related to each other. Intuitively, one would therefore 
expect that prediction sets correspond to spatially neighbored regions. Or, 
stated differently, prediction sets that include certain regions while omitting 
regions “in-between” may appear less meaningful. For example, if right mirror 
and inner mirror are included, one would expect front to be included, too. 


To check for the semantic meaningfulness of CP predictions, we examine the 
test data further. For the confidence level 1 — € = 0.96, the predicted sets and 
their number are displayed in Table 4. The sets {right mirror,inner mirror} 
and {inner mirror,center console} are examples of arguably less meaningful 
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Table 3: Results Gaze Dataset 


Conf-Meas. KDE DTM INN 


Error* 0.106 0.161 0.17 
Time (s) 379 64 723 
Avg. Cred. 32.19 39.68 34.14 


85 Size 1.2 1.28 1.4 
Acc. 0.853 0.849 0.895 


90 Size 1.34 1.73 1.44 
Acc. 0.932 0.878 0.931 


95 Size 1.56 2.38 1.59 
Acc. 0.947 0.897 0.962 


98 Size 1.69 4.81 2.06 
Acc. 0.953 0.965 0.988 


*Error of the Baseline model: 0.152. 


predictions, as they omit the in-between class front. These sets are marked 
with (*). Only once, the predicted set contains classes which are evidently not 
meaningful {right mirror, left mirror, front} marked with (**). This combina- 
tion was produced by INN. In total, 48 of the 256 theoretically possible sets 
are predicted. 


5 Discussion 


While the statistical guarantee of correctness holds with an increasing number 
of instances, regardless of the non-conformity measure chosen, this measure 
has an important influence on the efficiency of CP, that is, the size of prediction 
sets: The more suitably the non-conformity measure is chosen, the smaller 
these sets will be. In our case, non-conformity scores derived from KDE and 
INN provide significantly smaller sets than DTM, which is in line with the 
theory presented in [9]. Indeed, one should note that the distance to the mean 
is a rather crude measure, which ignores a lot of information about the class 
distributions. 
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Table 4: Sets at confidence level 1 — € = 0.96 


Set Size INN DTM KDE Sets 

0 0 0 8 {} 

1 2307 26 1718 {rm}, {Ish}, {sp}, {inm}, {Im}, {cc}, {fr} 

2 344 1007 1009 {rm, cc}, {rm, inm}*, {rm, fr}*, {Ish, fr}*, {fr, cc}, {Im, Ish}, {Im, fr}, 
{cc, sp}, {inm, sp}*, {inm, cc}*, {fr, sp}, {inm, fr} 

3 120 591 397 {lm, cc, sp}, {rsh, rm, cc}, {rsh, inm, fr}*, {rm, Im, fr}**, {rm, inm, fr}*, 
{rm, inm, cc}*, {inm, fr, cc}, {inm, cc, sp}*, {inm, fr, sp}, {Im, fr, sp}, 
{rm, fr, sp}*, {inm, Im, fr}, {Im, Ish, fr}, {fr, cc, sp} 

4 53 1445 6 {inm, fr, cc, sp}, {rm, Im, fr, sp}*, {Im, Ish, fr, sp}, {Im, fr, cc, sp}, 
{rsh, rm, fr, cc}, {rsh, rm, inm, cc}, {rm, inm, fr, sp}*, {rm, inm, fr, cc} 

5 312 60 0 {inm, Im, fr, cc, sp}, {rm, inm, fr, cc, sp}, {rsh, rm, inm, fr, cc}, 
{rsh, rm, inm, Im, fr} * 

6 2 9 0 {rm, inm, Im, fr, cc, sp}, {rsh, rm, inm, fr, cc, sp} 


Number of predicted sets: 48. 
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As for the semantic meaningfulness of the CP predictions, we observed that CP 
seems to capture the spatial structure of the classes quite well, with only a few 
exceptions. In total, 48 of the 256 possible sets were returned as predictions, 
only 13 of which displayed minor gaps and a single one severe “gaps” in 
the associated spatial region, i.e., the union of the regions associated with the 
classes in the set. 


CP can also be used as a point predictor. In that sense, regarding solely the error 
made when choosing the class with the highest p-value, CP with the KDE mea- 
sure as non-conformity score even outperforms the original gaze classification 
system, while also providing additional information. This indicates that the 
relations among the neural net’s outputs cannot solely be disclosed with the 
softmax function, and that the highest value in the output layer does not always 
correspond to the best prediction. 


6 Conclusion 


In safety-relevant applications of machine learning, such as Al-assisted dri- 
ving, a predictive model should produce reliable predictions and be aware of 
its own uncertainty. In this paper, we considered the problem of predicting the 
driver’s gazing direction and elaborate on the use of conformal prediction to re- 
present uncertainty. Instead of guessing a single class label (gazing direction), 
even in cases of uncertainty, conformal prediction yields set-valued predictions 
that are guaranteed to cover the true class with high probability. Our first 
experimental results with different variants of conformal prediction are rather 
promising. In particular, we have seen that the extension of our original gaze 
classification system by means of CP can indeed decrease the error rate of 
the model while still providing important information on the confidence of 
the estimates. Especially promising is the non-conformity measure based on 
kernel density estimation, as it yields the smallest set sizes at high confidence 
levels. 


A possible application of this method, which we seek to investigate in future 
work, is the handling of out-of-distribution data for classes not covered by the 
gaze classification system (e.g. blinks). Moreover, we plan to elaborate on 
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Mondrian Conformal Prediction with class-specific confidence levels, where 


the confidence levels are determined by the Pareto optimum between different 


criteria (e.g. average set size and accuracy of single predictions). 
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Abstract 


In material sciences, X-ray diffraction (XRD) or nuclear magnetic response 
(NMR) are methods to generate one-dimensional signals, describing intensities 
over an angle or a chemical shift. Each material has a characteristic profile 
and unknown samples are typically matched to known references. Automatic 
classification of one-dimensional signal patterns is a non-trivial task due to 
background noise and varying positions of measured intensities in identical 
probes. Convolutional Neural Networks prove to be particularly suitable, a 
limitation, though, is that adding new classes requires retraining. However, 
continuous discovery of new materials requires possibilities for easy class- 
extension. Siamese Neural Networks are able to extend data set classes easily 
and are popular in the field of face recognition, where new faces are constantly 
added to the database of references. In this paper, we apply Siamese networks 
to one-dimensional XRD-data for the first time and discuss the opportunities 
and challenges as well as areas of application. We show that Siamese networks 
are well suited for the transfer between XRD datasets, achieving an accuracy 
of 99% for materials not present in the training dataset. 
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Figure 1: Layout and functionality of an instrument for measuring XRD scans and a resulting 
diffraction pattern 


1 Introduction 


A typical task in the field of material analysis is to analyze an unknown sample 
in order to determine the underlying properties or to assign the sample to a 
known material. For this purpose, a variety of destructive and non-destructive 
methods exist, which scan the samples with a measuring instrument and then 
output a one- to multi-dimensional signal, which is subsequently analyzed 
using software or expert knowledge. Due to the large number of samples 
to be analyzed and the involvement of experts for the interpretation of the 
measurement signals, it is especially suitable for this field to utilize automated 
approaches to speed up the analysis of unknown material samples. 


One of the representatives of the methods for the analysis of crystalline samples 
is the X-ray diffraction (XRD), which utilizes Bragg’s Law to deduce the 
underlying crystal structure of the sample and thus, to determine the corre- 
sponding mineral. In the most prevalent technique for measuring XRD signals, 
called the Bragg-Brentano focusing geometry, the crystalline sample is crus- 
hed into a powder and measured using a moving pair of emitter and detector. 
Figure la shows the layout for an instrument that employs the Bragg-Brentano 
geometry. The powdered sample is placed in the center and X-ray emitter 
as well as detector are moving upwards on the circular orbit with radius R. 
The detector measures the intensity of diffracted X-rays subject to incident 
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angle 0. Figure 1b visualizes the resulting, one-dimensional XRD scan with 
the measured intensity (count) as a function of the emergent angle, typically 
described by 20 [1]. 


For the XRD scan analysis, each crystalline material forms a distinct diffraction 
pattern which is subject to the underlying crystal lattice properties. It is of inte- 
rest where peaks are located and how the intensities are distributed. The X-rays 
are reflected by atomic planes, so the number and location of peaks depends on 
the form and shape of the lattice, while the intensity is relative to the comprised 
atoms. Accordingly, the XRD scans are analyzed by comparing the diffraction 
pattern to a database of reference minerals. Some natural influences mean that 
the patterns are not exactly the same, causing small deviations in peak position, 
shape and intensities [1]. Thus, matching measured and reference patterns is 
not a trivial task, so professional expertise is required for the manual evaluation 
of XRD scans. 


Similarly, other methods for analyzing unknown material samples, like the 
proton nuclear magnetic resonance spectroscopy ('H NMR) or infrared spec- 
troscopy, also feature a one-dimensional signal that is characterized by peak 
locations, shape and intensities. Consequently, a variety of algorithms for the 
automated analysis have been developed, with a focus on machine and deep 
learning models in recent years, which gained great popularity especially in 
the field of image analysis [2, 7, 4]. In many applications, neural networks 
are employed as classifiers, which have a clear disadvantage for the use with 
material science data: Once trained for a certain dataset, no further materials 
can be added without the network (or at least some of its layers) having to be 
retrained. 


A special, extendable type of neural networks are the so-called Siamese net- 
works, which do not learn any classification rules, but distinguish representa- 
tives of one class from those of other classes by means of a distance function. 
Thus, the network deduces the features of two inputs to be compared in two 
identical, parallel processing lines, allowing the references to be constantly 
changed and extended, if the type of the calculated features remains the same. 
Typically, this type of neural network is used in the field of face recognition, 
where the inputs in the form of faces are always somewhat-similar but the 
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network extracts characteristic features to distinguish dissimilar from similar 
faces. Likewise, one-dimensional data such as XRD or NMR scans, which 
consist of characteristic peaks at specific locations, are potentially suitable for 
use with Siamese networks. 


Accordingly, we apply Siamese neural networks for the use with one-dimensio- 
nal data for the first time, using XRD data as an example. In contrast to regular 
neural network classifiers, which permanently learn the classification rules in 
the training phase and are therefore not extendable, the scan to be evaluated 
can be compared to a dynamically composed set of references for Siamese 
Networks. Hence, we 


1. train a classifier and Siamese network on one dataset 
to compare the baseline performance, 


2. apply the trained Siamese network to a second dataset without retraining, 


3. retrain the classifier for the second dataset using transfer learning. 


2 Related Work 


The XRD scans of distinct phases differ in the number of peaks and the corre- 
sponding positions, as demonstrated in Figure 2a. Between diffraction patterns 
of the identical material, natural influences cause small deviations of peak 
position, shape and intensities, which is visualized in Figure 2b. For example, 
stress or small deviations of the lattice lead to a shift of the peaks, varying 
crystallite sizes cause to wider, flatter peaks and non-ideal preparation of the 
samples lead to changes of the intensity ratios [1]. The classifiers receive the 
XRD scans as input and must learn to distinguish between variations within a 
phase and diffraction patterns of different materials. 


Naturally occurring crystalline materials, such as ores, are usually not pure 
minerals, but rather contain one major phase and small amounts of minor pha- 
ses. Thus, numerous works employed the Non-Negative Matrix Factorization 
(NMP) algorithm, which tries to describe the measured signal by combining se- 
veral components with different proportions [2]. Through a supervised training 
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Figure 2: Differences between XRD scans of distinct materials and samples of identical 
phases. The XRD scans serve as an input for the classifiers. 


procedure, the model learns to identify the components and depending on the 
respective fractions, the input sample is assigned to the most similar mineral 
of the training set. The NMF approach, however, is developed for specific, 
small applications like the ternary Al-Li-Fe oxide system with 6 phases to be 
distinguished [2]. It is unclear whether the performance can be transferred for 
larger datasets, since in the field of material sciences hundreds or thousands of 
materials frequently have to be distinguished from each other. 


Most recently, neural networks gained widespread popularity for use with 
image data and were also applied to XRD data. First, Park et al. [5] and Oviedo 
et al. [6] showed that convolutional neural networks work well for predicting 
crystalline properties like space group and the crystal system. Then, Wang et 
al. [7] demonstrated that a neural network with a VGG16 like architecture [7] 
is able to distinguish between 1012 metallic phases. Similarly, convolutio- 
nal neural networks proved to be reliable for the prediction of unknown one- 
dimensional NMR signals [4]. 


In application, the presented networks and algorithms are able to assign an 
unknown sample to a known reference material. However, it is a requirement 
that the corresponding material is also present in the training dataset, otherwise 
the classifier is not able to provide a meaningful result. For the XRD case with 
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over 1000 learned metallic phases, there is a high probability that the unknown 
material was also present in the training dataset for a metallic sample, but in a 
typical diffraction pattern database there are more than 10000 different phases. 
However, it is unclear whether it is possible to train a single classifier for all 
possible materials, while still identifying minor differences between similar 
phases. 


Furthermore, in the field of materials science it is common to identify new 
materials, which would require an extension of the existing classifiers. The 
presented neural networks [7, 5, 6] are not easily extensible, requiring to train 
the classifier again in order to add one class to the existing ones. Instead of 
retraining all of the layers, it is also possible to freeze the weights of the 
convolutional layers and only retrain the fully connected layers before the 
output, a method that is usually referred to as transfer learning [8]. 


Alternatively, a concept called Siamese Neural Networks offers the functiona- 
lity to extend or limit the classes during the prediction process. Once developed 
for signature verification [9], Siamese neural networks feature two inputs with 
parallel processing operations that share its weights. Accordingly, the network 
rates the similarity of the two processed inputs. For a classification task, one 
input is the unknown sample to be analyzed and the second one a reference 
material. After comparing the sample to all references, the networks assigns 
the class by choosing the highest similarity to a known material. Consequently, 
the network does not require retraining to add another class, but rather an 
additional input to be compared is added. 


The Siamese network approach is particularly popular in the field of face recog- 
nition, where the reference database is continuously supplemented with new 
faces [10]. Here, the Siamese network uses its convolutional layer structure to 
extract characteristic features from the 2D inputs to find similarities and distin- 
guish between different faces. In the field of material science, Zhang et al. [11] 
used Siamese networks for application with 2D NMR data to distinguish be- 
tween material classes. Two dimensional NMR data, however, is more time- 
consuming to measure, thus an expert is commonly used to manually evaluate 
the one-dimensional instead. Accordingly, we employ Siamese networks for 
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Figure 3: Architecture for VGG16-like classification network with convolutional layer (Conv) 
stacks, max pooling layers (Pooling) and three fully-connected layers (Dense) before 
the output 


the first time to classify one-dimensional data describing measured materials, 
for which currently only classifiers exist, that are not extendable. 


3 Methods 


3.1 Siamese Networks 


We want to evaluate whether Siamese networks are suitable for use with one- 
dimensional data, using XRD data as an example. To assess the performance 
of the Siamese network, we use the classification network developed by Wang 
et al. [7] as a reference, as presented in Figure 3. Here, we use diffraction 
patterns with 3250 datapoints (measured from 5° to 70° with A20 of 0.02) as 
an input for the network and employ an architecture of convolutional layers, 
max pooling operations and dropout before we reshape the features into an 
one-dimensional embeddings vector in the flatten layer. For the convolutional 
operations, we use a 5x1 kernel with 6 to 64 filters from the first to the last 
convolutional layer, apply max pooling with a pool size of 2 and stride 2 and 
utilize dropout of value 0.2 between the pooling and convolutional layers of the 
different stages to reduce overfitting during the training. After the reshaping 
of the features into an embeddings vector, the three fully-connected (dense) 
layers learn the classification rules before the networks outputs the prediction 
scores of the classes in the output layer. Therefore, the number of neurons in 
the output layer corresponds to the number of eligible classes. 
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Figure 4: Architecture for VGG16-like classification network 


Due to the dense layers, the classification network is not transferable to another 
set of classes without retraining. One concept to reduce the required retraining 
time is transfer learning where we freeze the weights of the convolutional 
layers, so they cannot be changed during the training process. 


Accordingly, we try two different tactics to retrain the classification network 
for different classes: 


1. Full retraining of all layers, and 


2. transfer learning approach, where we freeze the weights of the 
convolutional layers and only retrain the dense layers. 


Instead of learning the classification rules in the fully-connected layers, the 
Siamese network uses the embeddings vectors of the extracted features to rate 
the similarity of two fed input patterns, as visualized in Figure 4. The Siamese 
network employs two parallel processing pipelines to transform the input XRD 
scans into comparable features by sharing the weights between the correspon- 
ding layers. Thus, both inputs have to be of the same kind, the network is not 
able to compare a XRD scan with a crystalline properties vector. 


During the training process the Siamese network has to fine-tune the weights to 
decrease the distance between embeddings of the same class, while increasing 
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the distance between classes. The distance between two vectors v and w is 
usually calculated using the Euclidean distance 


d(v,w) = |v- wll = | 0i- w). (1) 


Thus, the network can be trained by feeding positive and negative examples for 
each sample and class and adapting the weights to minimize or maximize the 
distances accordingly. One loss function that combines this functionality into 
a single function to be optimized is the triplet loss [10] 


L(A,P,N) = max(||f(A) — FPP -F4 -FIP +@,0). 2) 


Here, the embedding of the input sample f(A) (anchor) is compared to the 
embeddings of a positive sample f(P) of the same class and a negative one 
f(N). The difference between the distances has to be at least equal to the 
margin & for the loss to be zero, otherwise it is positive. Therefore, the Siamese 
network is trained by minimizing the triplet loss function. 


3.2 Training Data 


Measured and labelled data is not available in the required amount since deep 
learning algorithms require several hundred training examples. Thus, we rely 
on synthetic XRD data, which we simulate based on a database of measured 
crystallite properties and physical principles, perfectly imitating measu- 
red diffraction data [12]. By simulating the XRD scans, we ensure that the 
training data contains the relevant deviations of diffraction peaks demonstrated 
in Figure 2b and every class is represented with the same number of scans. 
Accordingly, we assemble two datasets: 


For the first dataset A, we select 100 random crystallite materials from a re- 
ference database (which contains 345 materials in total) and simulate 50 va- 
riations for each, resulting in 5000 total synthetic XRD scans. The training, 
validation and test set are made up with a 50-20-30 split, so for every class 25 
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variations are used for the training set, 10 for the validation set and 15 for the 
test set. In total, dataset A contains 2500 samples in the training set and 1000 
and 1500 samples in the validation and test set respectively. Accordingly, we 
ensure that no class is under-represented in any of the sets. 


Secondly, we set up dataset B in the same way. We select another 100 random 
crystallite materials without any overlap between the classes of datasets A and 
B. Again, we split the 5000 synthetic XRD scans into training, validation and 
test data using the 50-20-30 split. 


Using datasets A and B, we evaluate the performance of a classifier and Sia- 
mese networks, as well as transfer learning and the transferability of the Sia- 
mese network. Hence, we train the classifier and Siamese network with the 
train set of dataset A and choose the weights that perform best on the validation 
set to avoid overfitting. Afterwards, we assess the performance of the VGG16- 
like classifier and the likewise Siamese network for dataset A using the test 
set. For the Siamese network, we pick a random scan of the train or validation 
set as the reference input for each class to calculate the distances for the test 
samples. As a performance metric, we choose the Top-1 and Top-3 accuracy 
for the highest prediction scores of the classifiers and the smallest distances for 
the Siamese network. 


Subsequently, we apply the trained Siamese network for the test set of dataset 
B without retraining. Before we evaluate the performance of the classifier for 
dataset B, we retrain the network using the two previously described tactics 
to evaluate the transfer learning approach. Finally, we rate the ability of the 
Siamese networks for transfer between datasets in comparison to the classifiers 
performance with full-retraining or transfer learning as an alternative. 


4 Results 


After training the classification and Siamese network using the train and va- 
lidation XRD scans, we evaluate the Top-1 and Top-3 accuracy for the 1500 
diffraction patterns of the test set. Table 1 presents the performance scores for 
both networks. Overall, we achieve near perfect results for the classifier with 


26 Proc. 30. Workshop Computational Intelligence, Berlin, 26.-27.11.2020 


Table 1: Comparison of Top-1 and Top-3 performance between the VGG16-like 
classifier and Siamese network for dataset A. 


Network Type 
Accuracy Classifier Siamese 


Top-1 99.9% 97% 
Top-3 100% 99.9% 


a 99.9% Top-1 accuracy and can be sure that the correct phase is within the 
three highest predicted classes (100% Top-3 accuracy). The performance of 
the Siamese network is slightly worse than the classifier, with scores of 97% 
and 99.9% Top-1 and Top-3 accuracy, respectively. The results of the Siamese 
network suggest that there are materials with very similar diffraction patterns 
in the dataset, which the Siamese network can hardly distinguish. Accordingly, 
we obtain an almost perfect Top-3 accuracy, because the network does not have 
to decide correctly. In contrast, the classification network is able to distinguish 
the subtle differences by means of the dense layers, which means that there is 
hardly any difference between Top-1 and Top-3 accuracy. 


Next we evaluate dataset B, for which we can use the Siamese network wit- 
hout having to retrain. The classification network is retrained using the full 
retraining and transfer learning approach, where we freeze the weights of the 
convolutional layers during the training process and only fine-tune the dense 
layers. Table 2 shows the prediction scores for the two retraining strategies 
of the classifier and the Siamese network without retraining. For the full 
retraining approach of the classifier, we achieve almost identical accuracies 
compared to dataset A, with 98.7% and 99.9% respectively. The slightly worse 
Top-1 accuracy indicates that the phases of dataset B are harder to distinguish 
from each other. Notably, the Siamese network achieves an almost identical 
performance for dataset B with a 99.3% Top-3 accuracy, although it was trained 
with phases of dataset A only. 


However, the transfer learning strategy is only partially suitable in the case 
of transferring XRD datasets. Although the classifier still manages to predict 
the correct phase in about 70% of all cases, it is significantly worse than the 
full retraining approach. This leads to the conclusion that the weights of the 
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Table 2: Comparison of Top-1 and Top-3 performance between the VGG16-like 
classifier and Siamese network for dataset B. 


Network Type 
Classifier Siamese 
Accuracy Full Retraining Transfer Learning 
Top-1 98.7% 70.4% 93.5% 
Top-3 99.9% 710.7% 99.3% 


convolutional layers were strongly specialized to emphasize the peculiarities 
of dataset A. Consequently, the network is no longer able to highlight the 
differences in the phases of dataset B in the transfer learning approach, so that 
the dense layers can use them for the classification. 


Furthermore, the classifier requires more time to be applied to dataset B, re- 
gardless of the retraining strategy. Firstly, 2500 synthetic training scans (+ 
1500 validation scans to avoid overfitting) must be generated so that the net- 
work learns the classification rules for the materials of dataset B. In compa- 
rison, the Siamese network requires only one reference per material, so 100 
synthetic scans. Secondly, the retraining process for the classifier takes some 
time!, while the Siamese network is applied to the second dataset without 
retraining. 


Overall, the Siamese networks prove that they are well suited for the use with 
XRD data and can also be used for materials that were not represented in the 
training dataset. Since the diffraction patterns of identical materials can differ 
significantly, as shown in Figure 2b, it is of interest to know how the Siamese 
network processes the XRD scans to achieve the small distance between the 
embedding vectors of identical materials. Hence, we visualize two embedding 
vectors for XRD scans of the same material that differ in peak position, shape 
and intensity in Figure 5. Interestingly, the Euclidean distance between the two 


' The duration of the training process is strongly dependent on the utilized computing power. 
Since the Siamese networks do not require to be retrained, we refrain from comparing the 
absolute training times. 
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Figure 5: Embedding vector of the Siamese network for two XRD scans of the identical material. 


vectors is definitely not zero, as there are clear differences between the network 
outputs. The output of the network reminds of the input XRD scans, which, 
however, are shrunk in spatial resolution by the Max Pooling layers. This 
presumably enables the network to compensate for the possible shifts of the 
diffraction angles, although the calculated peaks are not completely aligned. 


5 Conclusion and Outlook 


In summary, we employ Siamese neural networks for the first time with one- 
dimensional XRD data to distinguish between materials. Therefore we trained 
a reference classification network on two synthetic datasets and show that 
the Siamese Network achieves almost identical results. The strength of the 
Siamese network lies in the fact that it can be applied to materials not inclu- 
ded in the training dataset without the need for retraining of the network (in 
comparison to the classification network). Accordingly, we demonstrate that 
the Siamese networks are particularly suitable for use in the field of material 
sciences, where the classes are dynamically composed. The comparable trans- 
fer learning approach has not proven to be adequate in our case because the 
features learned in the classifier were too specific for transferability. 


Lastly, we showed how the neural network computes the embedding vectors, 
and that the Siamese network is not yet able to fully compensate for the dif- 
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ferences within diffraction patterns of the same material. Therefore, it has to 
be investigated whether the distance calculation using the pairwise, Euclidean 
distance for the one-dimensional data can be replaced by a more suitable al- 
ternative. In the field of time series analysis, for example, there is a distance 
calculation using so-called Dynamic Time Warping, which compensates for 
small displacements on the time axis. It is possible that such a function enables 
the network to compensate for the differences even better and thus differentia- 
tes the materials even more clearly. 


The next step is to validate our results with measured XRD scans. While it is 
necessary that as many real possible variations as possible are represented in 
the training data and hence, synthetically generated data are possibly essential 
for the training of a Siamese network, the trained deep learning model should 
be able to use measured XRD scans as a reference input. 


Moreover, we have trained Siamese networks using XRD data as an example, 
but a deployment with other one-dimensional data would also be imaginable. 
Especially for similar analytical methods like NMR, for which a classifier 
based on a neural network has already been developed, a transfer of our results 
would be conceivable. Accordingly, the next step is to test our Siamese network 
approach with other one-dimensional data. 
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Abstract 


Electronic control units (ECUs) are essential for many automobile components, 
e.g. engine, anti-lock braking system (ABS), steering and airbags. For some 
products, the 3D pose of each single ECU needs to be determined during 
series production. Deep learning approaches can not easily be applied to this 
problem, because labeled training data is not available in sufficient numbers. 
Thus, we train state-of-the-art artificial neural networks (ANNs) on purely 
synthetic training data, which is automatically created from a single CAD file. 
By randomizing parameters during rendering of training images, we enable 
inference on RGB images of a real sample part. In contrast to classic image 
processing approaches, this data-driven approach poses only few requirements 
regarding the measurement setup and transfers to related use cases with little 
development effort. 
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1 Introduction 


An exemplary use case for our approach is the 3D pose estimation of electronic 
control units (ECUs). The pose of each individual ECU needs to be detected 
robustly to enable an automated application of sealing materials. Generally, 
automated image processing is widely used in industrial series production [1, 
2]. Acommonly used method to detect ECU poses is by applying classic image 
processing. In some cases, these algorithms rely on fiducial marks imprinted on 
parts themselves. Setting up this image processing pipeline needs to be done by 
experts individually for each new product. Substituting classic image proces- 
sing by fully automatically designed deep learning on our task can potentially 
save a significant amount of development effort for new product designs. Furt- 
hermore, ANNs generally impose much lower requirements regarding camera 
resolution and surrounding conditions, enabling simpler measurement setups. 
Adding fiducial markers in industrial applications “may be undesirable” [10]. 
Dropping the need for fiducial markers prevents changes on the product design, 
which would involve product engineers. However, deep state-of-the-art ANN 
architectures require a large number of labeled images, which are usually not 
available for ECUs. In contrast to real-world images, rendered CAD images are 
a widely available data source in industrial settings. A network trained solely 
on CAD data cannot be directly applied to real images though, since those 
are different with respect to pixel color values (see [3]). This domain gap is 
generally present on many different settings that involve ANNs. Techniques 
to overcome this domain gap are called domain adaptation and are subject 
to current research (e.g. [4, 3]). Instead of adapting the ANN to a different 
domain, an approach can also include adaptation of the domain itself: Images 
from the training domain can be randomized to such an extent, that the real- 
world domain is "just another variation" [5]. This method is called domain 
randomization and has been tested successfully on other use-cases such as 
indoor drone flight [6] or robotic grasping and manipulation [5, 7, 8, 9, 10]. 
Sundermeyer et al. are working on pose estimation by using a denoising 
autoencoder architecture [11, 12]. In contrast to Sundermeyer et al., we are 
using state-of-the-art ANN architectures and evaluate different randomization 
parameters. Tremblay et al., Khirodkar et al. and Hinterstoisser et al. are using 
domain randomization for object detection [13, 14, 15]. Khirodkar et al. focus 
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on the use case of detecting cars and also includes a pose estimation. Domain 
randomization is not limited to image data however. For example, Peng et al. 
have randomized the dynamic properties of their simulation model to transfer 
a robotic control algorithm trained by deep reinforcement learning to the real 
world [16]. 


In contrast to our setup, pose estimation approaches like BB8 [17], SSD-6D 
[18] or PoseCNN [19] employ further pose refinement to improve accuracy 
[20, 21]. Tekin et al. [20] and Do et al. [21] use standardized datasets, which 
does not target the domain gap that is widely present in real-world use cases. 
Kleeberger et al. also use domain randomization for pose estimation, but uses 
depth data instead of RGB images [22]. 


The generally very promising results on related use cases motivate the deploy- 
ment of deep ANNs for pose estimation of ECUs. In Section 2 we outline 
our approach in a general way. This description is made on an abstract level 
without implementation details. It serves as a template, which can easily be 
applied to similar use cases. Subsequently, the experimental setup as described 
in Section 3 includes the details. In contrast to the previous section, we set out 
specific implementation aspects. A detailed overview is given e.g. over the pa- 
rameters that we randomize and the way we set up the training datasets. Results 
for our use case are presented in Section 4. Performance during inference is 
listed for the different training datasets. We include an estimation of how much 
errors differ from their mean values. In Section 5 we discuss the results. We 
analyze the effects of different randomizations of training data. Furthermore, 
we compare this data-driven approach against classic image processing setups, 
with a focus on the possible impact on future manufacturing setups. Eventually, 
in Section 6 the most relevant aspects are summarized and a detailed outlook 
onto further research opportunities is given. 


2 Methods 


Figure 1 gives an overview of our approach. To cope with the domain gap be- 
tween simulated and real-world images we use domain randomization. Unlike 
domain adaptation methods, domain randomization does not adapt the ANN 
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Figure 1: Overview of our approach 


to a different domain. Instead, it rather adapts the training domain itself. Our 
training domain consists of CAD images, which are rendered with arbitrary 
settings. We generate different datasets of synthetic images automatically. We 
train state-of-the-art ANNs on synthetic datasets and perform inference on real- 
world images. 


2.1 Domain randomization 


We created a generic pipeline for applying domain randomization as depicted 
in the upper left area of Figure 1 (our specific implementation details are outli- 
ned in Section 3). CAD files are loaded into a 3D rendering software suite. A 
common exchange format is used for transferring CAD files. Geometric featu- 
res are not changed in any way. Within the 3D renderer, several randomizations 
are applied. Specifically, we randomize shadows, translations and gray tones. 
Those are all parameters, which are not causally connected to the labels. For 
example, gray tones have no causal connection to rotation angles and must not 
affect inference. The pose of the model is set to a random angle configuration. 
An arbitrary number of angle configurations with individual randomizations 
is rendered and output to image files. Randomization parameters and rotation 
angles (=labels) are set via a scripting interface. Datasets with different rando- 
mization parameters can be created in any desired number automatically. This 
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kind of domain randomization is used to close the domain gap from training on 
CAD data to inference on real-world data (CAD2Real), saving the need for any 
labelled real-world images. Once finalized, this setup can be used for different 
products with minimum effort by replacing the CAD model and rerunning the 
script. Images for our baseline dataset CAD_unchanged are created with this 
method as well by simply omitting the randomizations. 


2.2 ANN inference on real-world images 


Real-world images are automatically preprocessed and then fed into the trained 
ANN. As shown in the lower part of Figure 1, the ANN infers all three rotation 
angles directly from real-world RGB images. We use a fully automatized 
preprocessing step to replace the background with a uniform gray tone. 


3 Experimental Setup 


We described our approach generically in Section 2. Here, we provide specific 
implementation details. We outline the creation of our different datasets used 
for training and during inference. Details regarding the ANN are also provided 
below. 


3.1 Datasets 


We work with two general types of datasets. On the one hand we have different 
training datasets. Those are solely based on CAD data and generated auto- 
matically. On the other hand we use experimental data for testing purposes. 
This experimental data is captured from real product samples in a laboratory 
setting. 
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Table 1: Parameters and their respective ranges 


Parameter Range 
X-/y-angle [-15°, 15°] 
Z-angle [-45°, 45°] 
Part translations [-1.5 cm, 1.5 cm] 
Camera translations [-1.5 cm, 1.5 cm] 
Gray tones [0.05, 0.8] 
xE[-1m, Im] 
Light positions y € [-1m, Im] 
z = f(x, y) 


3.1.1 Training sets 


For creation of our training datasets we use the 3D rendering software Blender. 
Blender is open-source software and freely available. It includes a powerful 
Python API that enables control via automatic scripts. The image generation 
pipeline is generally depicted in Figure 1. We import the CAD model as STEP- 
file, a data-format commonly used for CAD data exchange. The geometry 
itself is not modified. Rotations are applied around the x-/y- and z-axis within 
a range of -15 to 15 degrees for the x- and y-axis and a range of -45 to 45 
degrees for the z-axis. The rotation angles serve as labels and are therefore 
stored for each image created. Random translations of the model along the x- 
and y-axis are optionally applied within a range of -1.5 cm and 1.5 cm. Random 
translations of the camera along the z-axis are optionally applied within a range 
of -1.5 cm to 1.5 cm. One light source is placed high above the model, emitting 
uniform lighting. Two additional light sources are optionally included as well. 
Those are placed randomly above the model, generating random shadows. The 
color of the model itself is randomized in uniform tones of gray. We are aware 
that the introduction of additional light sources also makes the part appear 
brighter. Therefore, we try to limit this effect by keeping the distance of both 
additional lights to the part on a constant value. For each randomly drawn x- 
and y-coordinate the z-coordinate is calculated so that the distance to the part 
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is always equal, effectively placing both lights on an imaginary sphere. All 
parameters with respective ranges are listed in Table 1. 


We are using the training sets described below. During evaluation we compare 
our domain randomization approach against two datasets: 


e CAD_unchanged: Images are labeled with rotation angles around 
x-/y- and z-axis. There are no further modifications made. 


e CAD_augmented: We take the images from the set CAD_unchanged and 
apply random modifications. We modify brightness, translations and 
zoom. This data augmentation is applied to already rendered images. 
Data augmentation of this kind is usually used when not enough labeled 
training data samples are available. 


The following datasets include domain randomization. To test the influence of 
different parameters, we modify the extent of our randomization. Samples for 
each dataset are shown in Figure 2. 


e Dataset Rand_full includes all randomizations described above. 


e Dataset Rand_noShadows is the same as Rand_full except for the two 
additional light sources. Thus, no random shadows are included. 


e Dataset Rand_noTranslations is the same as Rand_full except for the 
translations of the part and the camera. Part and camera remain at con- 
stant positions. 


e Dataset Rand_noGraytones is the same as Rand_full except for randomi- 
zing the uniform gray tones. Part color is only affected by the position 
of both randomly placed lights. 


For each dataset described here, we have created 100 000+ training images. 
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Figure 2: Example images from training datasets 
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Figure 3: Example images from our real-world dataset with green background 


3.1.2 Real-world dataset 


We outline the creation of our real-world dataset, which is used as test set 
during inference. We use a single sample part from an ECU that is already 
available during prototype phases (before starting series production). We use 
the following laboratory setup: The images are recorded with a common SLR 
camera, since there are no specific requirements regarding the camera. Image 
resolution is later scaled down during pre-processing to below 300x300 pixels. 
The camera is mounted onto a fixed frame with two lighting sources on either 
side. The sample part is placed 0.5m below the camera on a green cardbo- 
ard layer. To introduce rotations around the horizontal x- and y-axis we are 
using 3D-printed wedges. We use multiple wedges with slopes of 2.5, 5 and 
10 degrees. Placing the wedges below our sample introduces the respective 
rotations. We recorded 20 different angle configurations, examples are shown 
in Figure 3. 


3.2 Artificial Neural Network: Training and inference 


We use the state-of-the-art architecture InceptionV3 [23], with weights pre- 
trained on ImageNet. The architecture including its weights can be imported 
within Keras [24] in two lines of code. To adjust to our number of labels we 
append a dense layer with three neurons. Pose estimation is sometimes imple- 
mented as classification, with a binning of rotation angle intervals. Binning 
limits the resolution of angle values to discrete intervals. To avoid this, we 
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opt for using regression as proposed by e.g. Mahendran et al. [25] instead. 
We use a linear activation function within the last layer, directly outputting 
continuous rotation angles around the x-/y- and z-axis. Each ANN is trained 
on a dataset as described above. We train on each dataset with 10 000 randomly 
drawn samples for 50 epochs. First runs have shown slightly differing results 
when re-drawing the samples and re-starting training. Therefore, we execute 
30 independent runs on each dataset. For final evaluation we use error metrics 
as described below. The lower area of Figure | shows our setup for inference 
with the trained ANN. We use preprocessing of the real-world images for 
removing the background. Since the images are taken on green background, 
preprocessing can be done automatically without much effort. 


3.3 Evaluation metrics 


In this section we describe the metrics used for evaluation later on. First, we 
focus on the evaluation of the training set characteristics (to get an impression 
of “which training set is best”). We then also estimate the confidence of 
statements that are made on our limited number of real-world images. For 
a single test or validation image, an ANN outputs three distinct angle values 
91,7 € {1,2,3}. These are compared against the corresponding ground truth 
angle values y;,i € {1,2,3} by calculating error 


ei =|| 3i—yi lh - (1) 


Each single ANN is evaluated on validation and test data. We merge all angle 
values (¥; and y;) into one vector per dataset ( and y). This is done on 
validation and test data separately for each ANN. Errors are then averaged 
over all angles and the respective number of images, yielding an mean error 
per ANN of 


1 
Z = — || p- ; 2 
CANN = a $- y ll (2) 
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N is the number of images in either the validation (N = 500) or test set (N = 20). 
Since we assess the fitness of the training set characteristics, we calculate the 
mean error per training set with 


M 
e= be EANN, j» (3) 


with M = 30 ANNs for each training dataset. We further calculate the standard 
deviation per training set 


M 
s= mL Eann, j — E) (4) 


1 
This yields a standard error of 


ese (5) 


We further calculate a margin of error for @. Since our sample size is limited, 
we use the Student’s t-distribution instead of the normal distribution. For our 
sample size of M = 30 and a confidence level of 99% we calculate the margin 
of error for @ as 


MOE = +ty_15z, (6) 


with ty_,; = 2.76. When working with a normal distribution instead, ty_ 
would be replaced by z. with a value of 2.58 for the 99% confidence level 
(ty—1 converges towards z. for very large sample sizes). The value of ty—1 is 
retrieved from a table for the Student’s t-distribution (see e.g. [26], p.206) and 
depends on the sample size and the confidence level. For our sample size of 
M = 30, we need a t-value that corresponds to M - 1 = 29 degrees of freedom. 
The values of the distribution function in the table can be used directly when 
working with a one-sided interval. Our two-sided interval gives an estimation 
into both directions (upper and lower bound). Therefore, we record t for a value 
of 0.995 for the distribution function to account for the two-sided interval. Up 
to now, we have mainly looked at how results differ when re-drawing samples 
from the training set and retraining the ANN. The limited number of images 
within the test set is another factor that might affect our results. For our limited 
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Table 2: Mean error on validation data and real-world data 


Training dataset Validation data Real-world data 
e [°] e [°] 
CAD_unchanged 0.4 + 0.04 11.7424 
CAD_augmented 0.4 + 0.05 2.5 + 0.6 
Rand_full 0.5 + 0.06 1.5 + 0.2 
Rand_noTranslations 0.5 + 0.07 7.74 2.8 
Rand_noShadows 0.5+0.1 3.6 + 0.8 
Rand_noGraytones 0.4 + 0.09 1.2 + 0.2 


amount of real-world samples we take a similar approach as above, but now 
on a more isolated scope. For the first ANN trained on dataset Rand_full, 
we evaluate the mean error &,.,, for angles around the x-axis on the test set 
containing 20 images. This is the same calculation as presented in Equation 
(2), but discarding angles around y- and z-axis. For those 20 error values e; we 
also calculate the standard deviation 


1 N 
Stest — N-I Lei Brest)? (7) 


and the margin of error 


MOE;es = +ty-1 2 (8) 


with ty—1 = 2.86 for N = 20 and a 99% confidence level. 


4 Results 


We train 30 ANNs on each of the datasets described in Section 3.1. We evaluate 
the mean absolute error and include a margin of error as described in Section 
3.3 for test and validation data on each dataset. Results are shown in Table 2. 
The outer left column indicates the dataset used during training. We then eva- 
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luate the performance on validation data and real-world data. Validation data 
images are from the same domain as the images used during training. Real- 
world data is taken from product samples and therefore substantially different. 
This is our target domain used for testing. ANNs trained on CAD_unchanged 
exhibit the lowest error on the validation set, but insufficient performance on 
real-world test images. For CAD_augmented the mean absolute error on real- 
world images is lower by a factor of approximately five. Another improvement 
by another factor of almost two over CAD_augmented is gained by using 
Rand_full: Angles around all axes are inferred with an mean error of 1.5 de- 
grees. Rand_full has full randomization applied. For Rand_noTranslations and 
Rand_noShadows we note an error-score inbetween CAD_unchanged and 
CAD_augmented. Dropping the randomization of translations affects perfor- 
mance worse than dropping randomization of shadows. Training on Rand_no- 
Graytones gives slightly better results than on Rand_full, but only by a small 
margin. Errors on the validation set increase from CAD_unchanged to CAD_- 
aug- mented and further rise for the randomized datasets. All randomized 
datasets show similar errors on validation data. We now take an isolated look 
onto the limited number of real-world samples as described in Section 3.3. 
We evaluate the error for rotations around the x-axis only and look at a single 
ANN trained on Rand_full. For our 20 real-world samples the ANN inference 
has a mean error of 1.6 + 0.4 degrees. This margin of error is calculated for 


a confidence level of 99%. Our experimental setup as described in Section 
3.1 has measurement errors which affect the ground truth labels. All results 
presented above are naturally limited to measurement tolerances. 


5 Discussion 


The results presented in Section 4 show a consistent advantage by training on 
randomized datasets for our application of pose estimation of ECUs. In the 
later part of this section we also discuss the impact of this research direction 
on image processing setups in related product applications. But first we look 
more closely at the effects of how the datasets were set up. First of all, the 
insufficient performance with the dataset CAD_unchanged is not surprising. 
In this case, the training images differ a lot from the real-world images. This 
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can be interpreted as a “wide” domain gap, leading to poor transferability from 
source domain to target domain. A significantly improved performance on real- 
world images is achieved by applying state-of-the-art data augmentation to the 
training set. Data augmentation is commonly used to expand the training set 
size. This is especially useful when dealing with a limited amount of labeled 
training data. We believe that there is a second benefit of data augmentation. 
Augmenting training data with changing brightness or translations also incre- 
ases the diversity within the training set. Increased diversity of features not 
relevant for inference favors transferability from source to target domain. This 
effect is exactly the underlying idea of domain randomization. Data augmenta- 
tion therefore can be seen as a “light version” of domain randomization. With 
full domain randomization applied, inference quality is further increased by 
another factor of approximately two. In comparison to data augmentation, 
domain randomization introduces even more diversity to the training set. This 
time, the introduced diversity goes beyond simply adjusting images. Modifi- 
cations of this kind cannot be easily applied to raw images. This is especially 
clear for the randomization of shadows. Calculating the position and intensity 
of shadows is an integral part during rendering and not easily possible when 
working on two-dimensional image data only. Also, translations made within 
the renderer lead to different outcomes compared to augmentation by transla- 
tions as well. Translations applied during state-of-the-art data augmentation 
will not change camera perspective. In contrast, inside the renderer not only 
the part position changes, but also the perspective view of the part changes. 
Translations in data augmentation therefore are different from those in domain 
randomization. However, we believe that the major error reduction is achieved 
simply by the fact that additional translations are introduced, no matter whether 
perspective changes or not. The poor performance with the dataset Rand_- 
noTranslations especially motivates the introduction of translations. To get a 
better impression of the different aspects of domain randomizations, in addition 
to dropping the added translations in Rand_noTransla- tions we dropped the 
shadow randomization in Rand_noShadows and the gray tone randomization in 
Rand_noGraytones. The performance with Rand_noShadows is worse compa- 
red to Rand_full and CAD_augmented. It seems that shadows and translations 
are both relevant factors when dealing with the present domain gap. However, 
dropping the randomization of gray tones in Rand_noGraytones has not caused 
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deteriorating performance, but even shows slight improvements compared to 
Rand_full. Since we trained 30 different ANNs we do not believe this effect 
is caused by chance. A possible explanation is that by randomization of gray 
tones many gray tones are outside of a usable scope (e.g. too light or too dark). 
This leads to a smaller part of training set being useful for inference, since 
some images are “too far off”. Also, we want to mention that the effects of 
introducing random shadows and random gray tones overlap in some sense. 
Both affect the color of the part at a certain position and are not entirely 
independent of each other. In our opinion, the unexpected behavior on the 
dataset Rand_noGraytones does not hurt the idea of domain randomization in 
general. It rather motivates further studies on the unique effects of different 
randomization types. Instead of randomizing gray tones only, e.g. textures 
could be introduced as well. 


Also, the varying error on the validation set from CAD datasets to randomized 
datasets motivates the analysis of different hyperparameters, mainly training 
set size and the number of epochs. The most efficient hyperparameters are 
generally likely to be different depending on the randomization type and extent. 
With domain randomization we desire to cover all aspects during rendering 
that make the real-world images different from the CAD images. This does not 
mean representing reality in simulation exactly however. For example when 
dealing with differing textures of the part in the real-world, applying textu- 
res with random noise to the simulation might be sufficient. The successful 
application of domain randomization on this use case shows high potential 
for future setups of image processing pipelines in series production of ECUs. 
The pipeline that we used can easily be transferred to other products. Ot- 
her products may be manufactured on different production lines. Subsequent 
processing of pose information varies between products or production lines. 
To standardize e.g. the naming of parameters on these interfaces and during 
further processing steps, an ontology-based approach is useful. Zhou et al. 
[27] and Svetashova et al. [28] have applied ontologies to other production 
processes successfully. This approach not only helps during technical setup, 
but also enables a common understanding of process-specific details across all 
involved persons [27, 28]. In contrast to many algorithms of classic image 
processing, our pose estimation approach is not bound to specific product 
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features. Further improvements aimed at improving inference accuracy can 
also maintain this product-independent aspect. This advantage is based on the 
fact, that the features needed for inference are learned by the ANN during 
the training process and not manually tuned. This data-driven approach has 
the advantage that for a different problem setting only the problem-specific 
training data needs to be supplied. We use only data sources that are available 
without major effort. With our approach, the problem-specific training data 
can be created automatically from a single CAD model file. Once again we 
emphasize that training is done on purely synthetic data. Not a single real- 
world image is needed during training. We see a high potential for a significant 
reduction of development effort in future image processing setups. 


6 Summary and Outlook 


We have set out to analyze the applicability of domain randomization to our use 
case of pose estimation of ECUs. Our goal was to minimize the domain gap 
and to deploy an ANN trained solely on synthetic data to real-world images. 
We have shown that applying domain randomization exceeds the effect of data 
augmentation by a factor of around two. The mean error for inferred rotations 
around all three axes is only 1.2 degrees on real-world images. The entire 
pipeline for creating randomized training datasets and training the ANN is 
fully automatized. The only input needed for creation of all training data is 
a single CAD model file, which is readily available for all ECUs. We use 
only a state-of-the-art ANN architecture with a minor adjustment regarding the 
output dimensionality. Training is done end-to-end, we infer all rotation angles 
directly from RGB images. No further depth data is needed. We have analyzed 
the application to our use case and motivated further research directions. The 
following aspects could make this approach fit for application in production of 
ECUs: 
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e We focused on detecting part rotations. For application in series pro- 
duction a post-processing step to determine translational degrees of free- 
dom needs to be appended. Including the translations directly into the 
labels as well might also be a feasible approach for our use case (directly 
inferring 6D pose information). 


e Our pre-processing currently requires a green-colored background. 
Randomizing the backgrounds as done in other use cases [5, 10, 13, 11] 
could make our approach feasible for a background containing work- 
piece carriers. This would drop the need for using any pre-processing at 
all. 


e We have provided insights into the effect of different randomizations. To 
further improve accuracy, these influences need to be examined in more 
detail. Ideally, this simultaneously includes adjustment of hyperparame- 
ters for training the ANN as well. 


The successful execution of the steps outlined above can reduce the entire 
pipeline for 6D pose estimation to solely a state-of-the-art ANN architecture. 
These architectures are conveniently available within the Keras library. This 
would provide a fully automated pipeline for pose estimation of new ECUs 
and similar products. 
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Abstract 


This paper describes the application and effects of different balancing methods 
on the learning behaviour and quality of a DCNN using acoustic data. The 
aim is to show to what extent these methods have positive as well as negative 
effects on the use case of the audio data. The evaluation is based on synthetic 
audio data with multiclass characteristics, because an overlay of effects should 
be avoided. This serves as preliminary work in order to apply the methodology 
to the measurement data for the classification of knife sharpness in forage 
harvesters in later investigations. According to applied balancing methods, 
the data are represented to the DCNN. The performance and quality shall be 
measured by formal qualification criteria. It turned out that SMOTE gives the 
best and most robust results. It shows a higher convergence compared to the 
other methods. Furthermore the worst results are produced with untreated raw 
data. 
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1 Introduction 


The application of audio data as learning data set to neural networks can be 
subject to a wide range of use cases. The use case to be addressed in this 
paper is classification. This means, in supervised learning, that given class 
data is learned and subsequently the developed classifier can independently 
divide represented data into classes [1]. It also means that the aim is to assign 
the entered data to a class in the neural network. In real use cases there are 
usually problems with more than two classes. These are defined as multi- 
class problems. It can also happen that data sets acquired by measurements 
show unequally distributed class representations. These data sets are called 
unbalanced. The problem with these lies in the under-represented data, which 
can lead to a shift in recognition accuracy towards the over-represented data. 
There are now various methods for dealing with these unbalanced data. The 
influence of balancing methods has already been investigated on different data 
and learning structures.[2] They show that there is a probably existing influence 
on learning behaviour. Thus, the actual goal is to find an optimal balancing 
method for the real use case of the acoustic data of forage harvesters. This is 
also a classification problem and should be able to draw conclusions about the 
sharpness of the knives based on the structure-borne noise of the cutting drum. 
These audio data are measured values. In order to avoid an accumulation of 
effects in the evaluation, the effects will be investigated in this paper using 
synthetic data. This creates a basis which can be applied to the measured data 
in subsequent work. 


The basis of the paper is synthetic data from a self-designed generator. This 
generator works on the principle of a recursive filter, more precisely the au- 
tocorrelation filter. The filter has the order N = 6. In total four classes are 
generated. For each class there is a corresponding representative, after which 
all class-related data is generated. The representative is audio data with a sam- 
pling rate f4 =44.1 kHz. The number of patterns is chosen very unbalanced so 
that the following data set is created: 


The data for the following investigation are subject to the distribution according 
to table 1. Most of the articles deal with two-class problems, whereas this 
one already uses a four-class problem as in the following use case of forage 
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Table 1: Classification of the synthetic data set 


class sample 


1 350 

2 2500 
3 4000 
4 9500 


harvesters. The uneven distribution is similar to the first real measurement 
series, with the sharpest knife condition being strongly underrepresented with 
class | and the dullest knife condition being most with class 4. 


2 Methods 


2.1 Balancing Methods 


The balancing procedures represent the methodological variation of the study. 
The conventional methods can be divided into five large groups.[3] The first 
group are the Non-Heuristic Sampling Methods. These work randomly and 
each delete or copy the data to arrive at a certain number of samples. These in- 
clude Random Over-Sampling and Random Under-Sampling. Random Over- 
Sampling copies randomly selected data from the under-represented classes 
and adjusts them quantitatively to the most common class. Random Under- 
Sampling works the other way round and adjusts the data to the least occurring 
class. It randomly deletes data samples until the quantities are adjusted.[4] 


The second group uses Synthetic Data Generation. These also approximate 
the data samples of all classes of a particular minority or majority. The Synthe- 
tic Minority Over-Sampling Technique (SMOTE) and the Adaptive Synthetic 
Sampling (ADASYN) are used for this. Both mentioned methods are over- 
sampling methods. The Synthetic Data Generation works in the feature space 
and less in the data space, like the Non-Heuristic Methods. The SMOTE 
algorithm over-samples the minority class by forming synthetic samples along 
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segments of the minority class of k nearest neighbors [5]. The nearest neig- 
hbors are determined randomly. Depending on the amount of over-sampling, 
a sample is generated in the direction of the determined nearest neighbor. 
Vectorially it is considered to form a difference vector. This is to be multiplied 
by a random number 6 = [0, 1] and appended to the original vector. In this way, 
a training record S can be considered with a minority class P and other classes 
N. For each sample p; € P there is a k nearest neighbor x. A new vector can 
now be defined as described above: 


Xnew = Pi + (x — pi) x 6 (1) 


This causes a selection of a point on a segment between a sample and its nearest 
neighbor.[6] 


ADASYN is motivated by SMOTE and aims to reduce bias including adaptive 

learning. Looking at the training data set S with m samples, {x;,y;},i=1,...,m, 
we propose x as sample and y as label. Now m, is described as the number of 

minority class samples and m; as the number of majority class samples. Now 

the degree of imbalance is calculated: 


d=m;,/mı (2) 


Considering that d = (0, 1] now determines how many samples must be gene- 
rated: 
G= (m —-m,)x B (3) 


The parameter B indicates the degree of balancing and is set to 1 in our case. 
Now, similar to SMOTE, we have to find the k nearest neighbour. This is 
created using the euclidian distance in n dimension and the radius is calculated: 


rj = A;/K,i=1,...,ms (4) 


Where A; describes the number of k nearest neighbours. Furthermore the 
radius has to be normalized in the following way: 


Anil ri (5) 
i=1 
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Now it is calculated how many synthetic data samples must be generated per 
minority sample: 
gi=fixG (6) 


Where G is the absolute number of synthetic data examples. Now the following 
rule can be used to generate the corresponding synthetic samples for each x; 
sample: 

Si = Xi + (Xzi — Xi) x ô (7) 


The main difference to the SMOTE method is the use of a density distribution 
Y; 7; = 1 as a criterion at which point synthetic data must be generated.[3] 


The third group uses Cost-Sensitive Learning. Here a cost factor for false 
predictions is determined, which affects the gradient. The data is not changed 
in this procedure. In this paper, the Weighting Factor is calculated using the 
samples per class. We assume there are N classes and the ith class has m; 
training samples. Then a Classweighting w;,i = 1,...,N can be determined, 
which will affect each class according to the number of its samples: 


Limi 
m; -N 


(8) 


Wi = 


The last group represents the Active Learning. Here, learning algorithms are 
applied in advance to balance the data. One application is the methodology 
of Cluster-Centroids. Depending on the minority class, clusters are formed in 
the point clouds of the other classes. That means, assuming a learning data 
set N with m samples, k > 2 subsets have to be found, which are represented 
by similar attributes. Thereby the single cluster focal points, but the centroids 
should differ from each other. With N = {x1,...,x,} a set of points is defined, 
which fulfill a similarity condition. It applies x; € R and d >1. Now u j and 
M = {1,..., Uk} is defined as centroid of the cluster G(j) and a cost quantity 
P = {G(1),...,G(k)} is introduced. This describes the cluster problem and is 
used for optimization: 


P: minimize z(W,M) 2, = wijd( Xi, Mj) (9) 
i=1 j= 
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Figure 1: Netsummary of deep convolutional neural network 


With w;; € {0;1} it is defined if x; counts to G(j) and d(x;, uj) defines the 
euclidean distance. There are now several methods to solve P which will not 
be discussed in detail. It should only be noted that k = m, defines itself as 
a sample set of the minority class. This leads to the consequence that the 
formation of cluster centroids counts as an under-sampling procedure.[7] 


2.2 Deep Convolutional Neural Network 


The structure of a deep convolutional neural network was used for the inves- 
tigation. This structure, as shown in figure 1, has not changed during the 
analysis. It was proceeded in such a way that there were several initializations, 
more precisely 100, of the network, because the weights w of the DCNN 
are initialized stochastically. This set Z, in the following called individuals, 
is generated per balancing procedure and then evaluated. Thus the purely 
stochastic influence of the weight initialization crystallizes in the results. 


The first layer is the Input layer of the net. This layer represents the audio 
samples of length ns = 4410, thus ts = 0.1 s. Furthermore, this is the first Con- 
volution Layer. It is one-dimensional due to its application to audio vectors. 
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In the configuration, a filter depth of 50 windows has been set, with each filter 
window having a width of 500. The activation function is the rectified linear 
unit. 

f(x) = max(0,x) (10) 


The following Dropout layer in conjunction with the Max- Pooling Layer 
creates a stochastic selection of the activated output neurons of the convolution 
layer. Thus, it represents a multinominal distribution during the training period, 
with the effect of avoiding overfitting in the learning behavior. The dropout rate 
is 0.4 and the pooling rate is 4.[8] Following this, a second Convolution Layer 
is added. This layer has the same filter width of 500, but only 5 filter windows. 
The sense of a second convolution layer is the abstract filter behaviour. The first 
layer should find features on a lower level of abstraction and the second layer 
on a higher level of abstraction. Dropout and pooling layer are added to this 
layer to achieve the equivalent effect as with the first convolution layer. The 
dropout rate and pooling rate are the same as the previous ones with 0.2 and 4. 
The subsequent Flatten Layer is used for dimension reduction and is merely 
a preparation of the structure for the subsequent Dense Layer. These are com- 
pletely connected layers and are characterized by precise learning behavior. 
The attached dropout layer also serves to prevent overfitting. The first dense 
layer has a size of 1000 units with an activation function of the hyperbolic 


tangent. 
e*—e* 


e*+e% 


f(x) = tanh(x) = (11) 


The following dropout layer has a rate of 0.5. The second dense layer is smaller 
and has only 500 units with an attached dropout layer and a rate of 0.5. The 
Output layer has the Softmax. 
ei 
iS — 12 

Oi = FF (12) 
It is one of the most frequently used components and is also applied here.[9] 
The cost function is the categorial crossentropy with applied optimizer ADAM. 


Zccel f(x), y) = —eylog f(x) (13) 
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2.3 Qualifying criteria 


In the previous section the structure of the classifier was shown. The theoretical 
basics of the sampling methods used were also discussed in advance. The clas- 
sification performance is of central importance for this paper, as it is the basis 
for the evaluation of the balancing methods. In most cases, scalar parameters 
are determined for this purpose, which have the static consideration of the cor- 
rectly assigned classes. However, in most cases this does not yield meaningful 
evaluations. The problem may arise that in case of unevenly distributed classes, 
i.e. unbalanced data sets, the overrepresented classes are preferred. This way, 
less existing classes, which have the same valence, would hardly be recognized, 
but the classifier would still have a comparatively high classification quality. A 
further problem can arise if the relevance of the correct classification is very 
different. This is not the case with the data used, but can arise when applying 
the same methodology to real data. A third problem, which can occur with real 
data, is uncertainty or noise in the expert specifications. For all measurement 
data sets, which originate from real applications, the actual measurement signal 
is faulty. So an expert estimation of the target state of the data can also be faulty. 
Due to the synthetic data generation this will not be the case in this paper. 


With the preceding illustrations of quality problems, the calculations of the 
used quality parameters shall now be shown. For this purpose, we first look at 
the feature data with N samples. The assignment of a class to a i data sample 
is described as CL; by the classifier. The assignment by the expert is defined as 
CL}. The element s; of the vector s = [s1,...,sy]/ is used to define an object 
of a test set of the class j. After calculation of the classification results an 
element r; € [0;s;] of the vector r = [r),...,ry]’, which represents the output 
of the classification. This vector contains the number of matching objects of a 
class j. This results in the simplest qualification criterion, which is called q1: 


ELi rj 
a= oy (14) 

ae Sj 
This includes the already mentioned problem of preferring the most common 
class. As a remedy, a criterion must be worked out, which also considers the 
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weakly occurring classes. For this purpose q2 is introduced as follows: 
= 1 £ Fj (15) 
qQ = N i ; 


By normalizing to the number of samples, all parts with the same weight enter 
the quality criterion. One way to calculate a measure for the worst recognized 
class is the geometric mean. This remains independent of the number of objects 
in a class and is defined as follows: 


(16) 


With the evaluation of the quotient r;/s; the class j can be examined for correct 
recognition. If the attention is now turned to the generally worst class after 


q4 „mi (2) K 


, the robustness of the classifier can be assessed. However, the parameter 
q4 is to be supplemented by a more robust measure. The classification can 
be regarded as a nominal scale, thus the kappa coefficient is to be used for 
the judgement agreement. Its starting point is the permutation matrix of two 
judges, the expert and the classifier. The form presented in the following 
contains the expert specification as rows of the matrix and the judgements 
calculated by the classifier as columns. The elements of the swap matrix K 
are relative frequencies k;;: 


ky... kin 
K=| 5 °, 3% (18) 


ku, ... knn 
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The frequencies of occurrence of target and actual classification are defined 
and counted as ka;;. A normalization is done by the set of test data records T 


after kaij: 
ka; a 
kij = a 


The relative frequency of concordant judgments results along the main diago- 


(19) 


nal of the matrix K, as follows: 
N 
Po= } kü (20) 
i=1 


The proportion of expected concordant judgments is compared to this using the 
following relative frequency: 


N [N 2 
p=} (x w) (21) 
i=l \j=l 


The above calculation is different from the original definition and uses the 
squares of the row sums instead of the products of the row and column sums. 
This is done to achieve an increased weighting of the target classification. The 
reason for this is that the original kappa coefficient assumes two independent 
and equal viewers, which is not the case when classifying with a neural net- 
work. The kappa coefficient itself is defined as follows: 


x — Po Pe (22) 
1—pe 


Finally, there is also the demand for performance in the learning behavior of 

neural networks. For this purpose a parameter t; is introduced, where i = 1,...,J. 
This parameter is measured during the calculation and specifies the calculation 

time for the learning process until a convergence criterion is reached. The 

background is that with over-sampling or under-sampling the amount of lear- 

ning data is changed. This should be taken into account when considering the 

balancing methods. The quality, robustness and performance of the classifier 

is evaluated with the six criteria described above.[10] 
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Balancing comparison - BEST [t, q1, q2] 


Qualifying criteria 


Raw Data @ Random Under-Sampling Œ Random Over-Sampling Classweighting EISMOTE MADASYN MCluster-Centroids 


Figure 2: Balancing comparison of the best runs with [t, ql, q2] 


3 Results 


As already mentioned, training was done with a total of J =100 runs of the 
same data and the same balancing procedure. The only difference between 
these individuals lies in their stochastic initialization of their weights. At 
first the best trainings resulst are to mention. For this purpose the parameters 
t,91,92,93,94, and K were used. 


In figure 2 you can see that there are big differences in the area of training time. 
The SMOTE and ADASYN methods are the learning methods with the longest 
training times and the Random Under-Sampling with the Cluster-Centroids the 
shortest training times. This is due to the creation or removal of samples which 
actively influence the teaching time. The parameters qı and q2 show that the 
maximum of 1 is reached as the best run and therefore there is no difference in 
the balancing procedures. 


The analysis of the parameters q3, q4 and K also do not show any differences 
in the methodologies, as figure 3 shows. Thus, there is a general possible 
convergence of all six listed sampling methods, including the raw data, towards 
an optimal classification result. A reason for this result can be the stochastic 
initialization with the quantity / = 100, by which a favourable starting position 
of the weights w for convergence is given. 
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Balancing comparison - BEST [q3, q4, K] 


Value 


0,40 


a3 a4 K 
Qualifying criteria 


Raw Data @Random Under-Sampling Œ Random Over-Sampling WClassweighting EISMOTE MADASYN MCluster-Centroids 


Figure 3: Balancing comparison of the best runs with [q3, q4, k] 


Balancing comparison - MIDDLE [t, q1, q2] 


0,80 


Value 


0,40 


0,00 = 


DRaw Data £ Random Under-Sampling Random Over-Sampling © Classweighting @SMOTE MADASYN Cluster-Centroids 


tinh q1 
Qualifying criteria 


Figure 4: Balancing comparison of the middled runs with [t, q1, q2] 


Representations of the best learning outcomes of a set of individuals can pro- 
vide a static measure of convergence ability, but the statistical mean is a 
quantity-based measure of how often the set of individuals moves towards 
convergence. Therefore, the parameters t,q1,q2,q3,q4 and K are following 
formally averaged: 


se de 
Fey (23) 


I represents the set of individuals and x is the formal representative of all used 
parameters. 
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Balancing comparison - MIDDLE [q3, q4, K] 


q3 q4 K 


Qualifying criteria 


GRaw Data Random Under-Sampling Œ Random Over-Sampling WClassweighting @SMOTE MADASYN BCluster-Centroids 


Figure 5: Balancing comparison of the middled runs with [q3, q4, k] 


By looking at the parameter ¢ in figure 4 you can see that the behavior of 
the best runs is reflected. However, the set of balancers which need the lon- 
ger calculation time is additionally supplemented by Random Over-Sampling. 
Noticeable with the parameters qı and q2 is that the Classweighting performs 
worst and SMOTE as synthetic over-sampling is best. Interesting is the diffe- 
rence in raw data between the parameters qı and q2. Since qı prefers the most 
represented class and q2 the less represented class, this means that the classifier 
can better learn the over-represented data with pure raw data. 


The two parameters q3 and q4 are used to evaluate the worst recognized class 
and have a statement for the robustness of the classifier. Looking at these 
in figure 5 it becomes immediately clear that SMOTE produces very good 
results compared to the other methods and that the raw data provide a less 
robust classification. The results of the x coefficient are similar to those of the 
parameter qı. It shows that the class weights give a relatively poor result and 
again the SMOTE method generates the best results. 


By considering the averaged coefficients, a quantity-related representation is 
achieved. The distributions of the parameter values are also to be analyzed. 
Thus, the help of the box plots is used. The boundaries of the box contained in 
the plots represent so-called quantiles, more precisely x0.25, x0.5 and x0.75. Usu- 
ally a so-called whisker is specified in connection with this, which is defined 
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Figure 6: Boxplot of balancing methods regarding training time t 


in the following and non-uniform way: 
w = 1.5: (x0.75 —X0.25) (24) 


This whisker limits the values up and down. However, if the whisker falls 
below or exceeds the minimum and maximum values, these are assumed. This 
procedure was used in the following illustrations. 


The variance of the t parameter shows, in figure 6, that the over-sampling pro- 
cedure has an increased median including larger minima and maxima. This can 
be attributed to the quantity-related increase in samples. The under-sampling 
methods, on the other hand, scatter less, as well as the median is lower. A 
well-balanced means for this is the Classweighting. 


The box plots of the parameter gı in figure 7 show differences to the conven- 
tional appearance of these. First of all, it should be noted that this one, like 
the following box plots, is limited in [0; 1]. Furthermore, it can be seen that the 
scattering measure of all balancing procedures is similar. Only the raw data 
differ. It is interesting to note that the median x05 and the 3rd quantile x9.75 are 
very close together. This indicates a high frequency at this point. In addition, 
it can be seen that the median for the raw data is lower than for the balancing 
procedures. 
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Boxplot parameter q1 
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Figure 7: Boxplot of balancing methods regarding parameter q1 


The parameter q2 shows similar behavior in figure 8 as qı. However, since it 
prefers the underrepresented classes, there is more variation. Also, the classi- 
fication is worse when teaching with raw data, which shows its median. 


With q3 a robustness measure is represented in figure 9. The scatter is very 
high for all methods and covers the whole value range [0; 1]. For all sampling 
methods, however, the median is close to 1, which has a high frequency at this 
point. Only the raw data performs worse. Its median is close to 0.5. This 
indicates a balanced distribution. 


Since q4 is also a measure for robustness, it can be assumed that a similar form 
of box plots as in figure 9 can be expected. This is confirmed with figure 10. 


As a final parameter K should be addressed. In previous plots this parameter 
showed equivalent behavior to qı with respect to the best run and the averaged 
runs. It can be seen in figure 11 that all sampling procedures have a high degree 
of dispersion. For all of them xo.25 is close to 0.25. The minimum values also 
tend to be as low as 0. The median with the 3rd quantile is close to 1. This 
again indicates a higher frequency at this point. As an exception the individuals 
of the classifications with the raw data are shown. These scatter less, because 
their Ist quantile is around 0.7, but the median is also lower, 0.84 to be exact. 
This shows the clear influence of the balancing procedures on the stochastic 
initialization of the weights for the given dataset. 
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Figure 8: Boxplot of balancing methods regarding parameter q2 


4 Conclusion 


In this paper the aim was to investigate the influence of different balancing 
methods on audio data. In order to isolate the effect, it was decided to use 
synthetic data as a basis. With this data a predefined structure of a deep 
convolutional neural network was trained. The only variant part of the network 
was the random initialization of the weights. 


With the representation in figure 2 and 3 statements concerning convergence 
can be made. All balancing methods show, from a static point of view, the 
convergence ability to solve the problem. The defined parameters all reach the 
value 1 in [0; 1], except the calculation time. This time is proportional to the 
number of samples used in the learning process. In figure 4 and 5 the averaged 
values of the parameters is displayed. This shows a dynamic measure, since 
with each run, i.e. an individual, a different convergence is achieved and the 
average value is used to consider these individuals. At first it can be seen, 
that the calculation time is proportional to the given number of samples in the 
balanced datasets. Furthermore, pure raw data without balancers in the learning 
behavior prefer the overrepresented classes. With inverse logic this behavior is 
shown for the underrepresented data. The best middle balancing method is the 
SMOTE algorithm. This represents itself as extremely convergent with regard 
to all quality parameters q1,q2,q3,q4 and K, whereas the other procedures tend 
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Figure 9: Boxplot of balancing methods regarding parameter q3 


to be less optimal. With the analysis of the dispersion of the quality parameters, 
in figure 6 to 11, high variances are found in all balancing procedures. It shows 
that, in addition to the convergence, there is also a divergence for each algo- 
rithm due to pessimistic initialization of the weights. Likewise, the measure of 
dispersion itself is similar to most balancers per parameter and decides only on 
the raw data. A strong binarity between con- and divergence crystallizes. 


From the results it can be concluded that the best balancing method is SMOTE. 
This of course only requires the application to the synthetic audio data. In 
the following work the effects on real measurement data have to be investi- 
gated. Since this contains noisy expert specifications in addition to the noise- 
superimposed signals, the results have to be awaited. Another effect could also 
occur when considering robustness as well as convergence. 
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Figure 10: Boxplot of balancing methods regarding parameter q4 
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Figure 11: Boxplot of balancing methods regarding parameter k 
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Abstract 


In recent years several new LiDAR datasets for object detection were publis- 
hed. All these datasets were recorded with different LiDAR setups and at 
different locations. KITTI, for example, has 64 channels and was recorded 
in Germany, whereas Lyft (Level 5) has only 40 channels and was recorded 
in the USA. This leads to different characteristics of the LiDAR point clouds. 
In this paper, we present and evaluate a way to transform KITTI BEV maps 
such that they look like Lyft BEV maps. For this transformation we use the 
state-of-the-art image-to-image translator CycleGAN. The transformation is 
evaluated by two strategies: Firstly we test if the translated KITTI BEV maps 
work better for an object detector, which is trained on Lyft. Secondly we test if 
the characteristic structure of the Lyft dataset (number of channels, location of 
points) is adopted from the translated point cloud. The conducted experiments 
showed that after the translation the KITTI BEV maps are more similar to Lyft 
BEV maps, but the detection got worse. 
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1 Introduction 


Especially in autonomous driving, any method used to evaluate the 3D driving 
environment must be fail safe. LiDAR, RADAR and ultrasonic sensors are 
used to obtain 3D information about the surrounding area. While an ultrasonic 
sensor only works in short range and the 3D information of a RADAR is very 
sparse, the LiDAR sensor can produce accurate information of the immediate 
surroundings such as pedestrians or other vehicles. 


With the KITTI dataset [1], a dataset for autonomous driving containing Li- 
DAR point clouds is publicly available since 2012 and is still used as a standard 
dataset. In recent years, new datasets for autonomous driving containing Li- 
DAR point clouds became publicly available e.g. 2018 nuScenes [2], 2019 Lyft 
Level 5 [3], 2019 Audi A2D2 [4] or 2020 Ford AV [5]. With Astyx HiRes [6] 
the first dataset for autonomous driving was published, which also contains 
point clouds from an high resolution RADAR in addition to LiDAR point 
clouds. As Table | shows, all datasets use different LiDAR setups. Significant 
differences between the datasets are the number of channels and the vertical 
resolution, which depends on the type of LiDAR that was used. While for 
most datasets one LIDAR is mounted on the top of the roof, for the Audi dataset 
multiple LiDARs were used: one at the front of the roof and 4 at the corners of 
the roof, with a slightly tilt. Due to the different locations of the LiDARs the 
point clouds of A2D2 look different than point clouds from a single LiDAR. 
The point clouds in A2D2 have a grid pattern, and not a circular pattern as in 
the other datasets with one top LiDAR. Also the different LiDAR setups for 
the datasets lead to a different appearance of the LiDAR point clouds (Fig. 1). 
Both KITTI and Lyft have a LIDAR mounted on top of the roof and generate a 
360 degree vision of the surrounding area by multiple laser channels. Because 
in KITTI many laser channels scan the close area of the ego vehicle, a point 
cloud from the KITTI dataset contains much more points near the ego vehicle 
than a point cloud from Lyft. 


! While working on this paper only the data of the top LiDAR were available. 
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Table 1: Comparison of the LiDAR setups of different datasets. Specs marked with * are taken 
from the datasheet of the LiDAR manufacturer and not from the source of the dataset. 
T = Top; B=Bumper; F=Front; C=Corner. 


KITTI Lyft Audi A2D2 Astyx 
Number of LiDARs 1 1+2 5 1 
Range [m] 120 - 100 100 
Channels 64 40/64 ! 16 16 
Azimuthal FOV [°] 360 360 360 360 
Vertical FOV [°] 26.8 - 30 30 
Azimuth resol. [°] 0.08 - 0.35* 0.2 0.1-0.4 * 0.1-0.4 * 
Vertical resol. [°] 0.4* - 2 25 
Rate [Hz] 10 10 10 10 
Position T T+2B! F+4C T 
Intensity NG x V NG 
Type HDL-64E - VLP-16 VLP-16 


These LiDAR datasets are used to train and test methods e.g. for 2D and 
3D object detection, segmentation, SLAM or optical flow. Wang et al. [7] 
showed that an object detector trained on one dataset (source) performs worse 
on another dataset (target). They concluded that a possible reason is the size 
of cars in different regions of the world. They applied different strategies 
that focus on the size of the cars: enlarge or shrink the bounding box and 
the corresponding point clouds in the training scenes (SN), continue training 
with some ground truth point clouds from the target dataset (FS) or enlarge or 
shrink the predicted boxes of the detector (OT). These three methods proved 
to be effective on the performance of the detector. All these methods base on 
changing the detection method and not changing the target dataset. 


For multiple methods, or other tasks than object detection, it could be difficult 
and time consuming to modify all methods and retrain them. It would be faster 
if we could transform the point clouds of the target dataset so that they look 
like the source dataset on which we trained our methods. So in this paper we 
will focus on the different LiDAR setups used in the datasets. Such a type of 
problem is called domain adaption (Sec. 2.2). One part of domain adaptation 
is unsupervised image to image translation. Methods like CycleGAN [8] or 
UNIT [9] have shown promising results in the translation between images of 
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Figure 1: Sample of Bird’s-Eye-View for Lyft and KITTI dataset. Both BEV maps are on 
information level Position. 


different domains e.g. night to day, summer to winter or simulated to real data. 
While these methods work well for 2D image domains as seen in [8, 9], they 
are not designed for the application on 3D point clouds, like the LiDAR point 
cloud. Approaches for 3D object detection like Simon et al. [10] convert the 
3D point clouds to Birds-Eye-View RGB-maps (BEV) and train a 2D object 
detector on this converted data. They encoded additional information, like 
height, density and intensity of the point clouds into the RGB color channels 
of the BEV map. 


Sallab et al. [11] and Saleh et al. [12] proposed frameworks to translate synt- 
hetically BEV LiDAR maps to real BEV LiDAR maps. The simulated point 
clouds were generated with the CARLA simulator [13]. Without additional 
post-processing, the simulated point clouds are too smooth and miss artifacts 
that are common in real LiDAR point clouds. Sallabs and Salehs approaches 
used both KITTI as a real world dataset. Sallab did not state which LiDAR 
setup their simulated LiDAR used and encoded the height information of the 
3D point clouds into the BEV maps. But Sallab did not use any additional 
information about intensity or density of the LiDAR points. The simulated 
BEV maps were translated with CycleGAN and got used together with real 
BEV maps from KITTI to train an object detector. Sallab improved the object 
detector from 65.3 % mAP, when using KITTI point clouds with 100.000 raw 
simulated point clouds to 71.5 %, when using KITTI point clouds and 100.000 
translated simulated point clouds. Saleh used a simulated Velodyne HDL-64E 
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LiDAR for the simulated point clouds, so they have the same type of LiDAR 
as for the KITTI dataset (Tab. 1). Saleh mapped the 3D point clouds into BEV 
without any additional information and used only the location of the points. 
Saleh trained an object detector with several dataset combinations and tested it 
on the same split from the KITTI dataset. For KITTI alone they had an AP of 
57.26 %, this was improved with the addition of 6.000 simulated point clouds 
to 59.16 % and with the addition of 6.000 simulated and translated point clouds 
to 64.29 %. Using only the 6.000 simulated point clouds for training resulted 
in an AP of 29.93 %, which could also be improved to 34.78 % by translating 
point clouds. However, this also shows, that the translation alone could not 
produce perfect real looking data. 


We aim to make the BEV maps of two real world LiDAR datasets look more 
similar. For this we will: 


e carry out a translation using CycleGAN between 
two real world LiDAR datasets for the first time 


e use different information levels of BEV maps for the translation 
(Position, Height, Height+Density, Height+Density+Intensity) 


evaluate the quality of this translation with two different strategies. Usa- 
bility: test if the translated target BEV maps work better for an object de- 
tector, which is trained on the source. Structure: test if the characteristic 
structure of the Lyft dataset (number of channels, location of points) is 
adopted from the translated BEV map. 


2 Method 


2.1 Datasets 


We focus on the translation from KITTI to Lyft, because Lyft has a similar Li- 
DAR setup as KITTI, but not totally similar. Both KITTI and Lyft have LiDAR 
mounted on the roof and generate a 360 degree vision of the surrounding area 
by multiple laser channels. KITTI has just one top LiDAR and we will also 
only use the LiDAR point clouds of Lyft’s top LiDAR. The circular pattern of 
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both LiDAR point clouds can be seen in Figure 1. A point cloud from KITTI 
has much more points near the ego vehicle than a point cloud from Lyft. This 
can be seen in Figure 1, as more white pixels are located around the ego vehicle 
of KITTI in the bottom center than in the BEV map of Lyft. Furthermore, in 
Figure 1 the effect of the different number of LiDAR channels of KITTI and 
Lyft can be seen. It can be seen that the LiDAR of Lyft produces less white 
circles than that of KITTI in the BEV map. This is because the LiDAR in 
KITTI has more laser channels to scan the close area of the ego vehicle, while 
in Lyft the lasers are such that they focus the scan on the midrange. For the 
BEV maps the most significant differences between the two datasets can be 
seen on parts that belong to the street, while in both examples of Figure 1 the 
cars have a similar L-shape. The number of channels is 64 for KITTI’s and 40 
for Lyft’s top LiDAR (Tab. 1). Hence a translation from KITTI to Lyft should 
delete some of these wave lines, without destroying the L-shape of a vehicle. 


2.2 Domain Adaptation 


The problem of training a method on one dataset (source domain) and applying 
it to another dataset (target domain), is called domain adaptation. Domain 
adaptation is a type of transfer learning and aims to bridge the gap between 
different domains of data [14]. A specific type of domain adaption is the image- 
to-image translation. In this type an input image from the source domain can be 
transformed to an image that is similar to the distribution of the target domain. 
The transformation is observable as the source image adapts the "style" of 
the target domain. Such a transformation can be studied in two settings: the 
supervised and the unsupervised setting. In the supervised translation one, 
every target image has a corresponding source image. In the unsupervised 
one, no target image has a corresponding source image. As Lyft and KITTI 
are recorded at different locations and different times, we do not have paired 
data. 


Within the scope of this work, we will use the Cycle-Consistent Adversarial 
Network (CylceGAN) [8]. By taking pairs of images out of different domains 
CycleGAN learns how to apply the characteristics of one domain to the images 
of the other. CycleGAN consists of two generators and two discriminators: 
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one generator takes images of the first domain and output images of the second 
domain, the second generator vice versa. The demonstrators determine how 
plausible the output of the generators are for the domains. In addition, Cycle- 
GAN applies the cycle-consistency loss. So in CycleGAN adversarial losses 
are combined with cycle consistency loss, i.e. evaluating also a possible back- 
projection of generated data from the target domain into the source domain. 


Because CycleGAN? is designed to translate 2D data, we will convert the 3D 
point clouds to 609 x 609 Birds-Eye- View RGB-maps. We can encode different 
levels of information into the pixel color values of the BEV. The simplest way 
of a BEV map (Position) uses only the location of the points and ignores the 
height, such that the pixels are 255 iff there is a LIDAR point at this position 
and 0 if not. See Figure 1 as an example. In the next level of information 
(Height), we use the height of the points and encode this information into the 
greyscale value of the pixel. If more than one point is at the same position, we 
use the maximum height. In the third level of information (Height+Density) 
we map the density of points that lie over each other in addition to the height 
in the first two color channels of the pixel. In the last level of information 
(Height+Density+Intensity), we also encode the intensity information, which 
describes how good the laser beam is reflected, into the third color channel 
of the pixel. If more than one point gets mapped to the same pixel, then the 
maximum intensity is used. 


2.3 Evaluation 
We choose two ways to evaluate how good the translation of the point clouds 
provided by KITTI into the structure provided by Lyft is: 


e Usability: How good can be the transformed BEV maps be used for task 
of autonomous driving, when the method is trained on the source dataset. 


e Structure: How well does the translated BEV maps represent the 
structure of the LiDAR setup from the source dataset. 


2 Our CycleGAN implementation based on the implementation of Ming-Yu Liu., which can be 
found on GitHub https://github.com/junyanz/CycleGAN, Jan. 2020. 
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2.3.1 Usability 


To have a quantitative criterion how good the translation fulfills the criterion 
Usability we will use a similar strategy as [11, 12] and test the translation on 
the detector Complex YOLO [10]. Instead of retraining with additional BEV 
maps, we train Complex YOLO on Lyft and test how the detector performed on 
the translated KITTI BEV map compared to the original KITTI BEV map. 


ComplexYOLO is a real-time 3D object detection network operating on Li- 
DAR BEV maps. It is based on the YOLO framework with the addition of 
a specific Euler-Region-Proposal approach, that estimates the orientation of 
objects. The determination of the orientation of an object is necessary in BEV, 
since an object gets detected from above instead of the common front view and 
therefore can be rotated.’ 


To improve the detection of objects important structures as the shape of a car 
should not be destroyed after the translation, while the characteristics of the 
Lyft BEV maps should be adopted from the translated KITTI BEV maps. 


As metrics we will use the well known intersection over union (IoU) with a 
threshold of 0.5 and the average precision (AP). 


2.3.2 Structure 


A criterion, which does not mostly depend on geometric properties of a scene, 
and distinguishes between different LiDAR setups, is required to measure how 
well the LIDAR BEV map of KITTI is translated to the characteristics of Lyft 
BEV maps. For this we map the BEV map back to 3D coordinates. In the 
case Position we do not have any height information. So all the height values 
are set to 0 for all points. These 3D coordinates are transformed to a spherical 
coordinate system. In a spherical coordinate system every point consists of 
radial distance, polar angle, and azimuthal angle (r,0,®). In the case that 


3 Our ComplexYOLO implementation based on the unofficial implementation from Deepak 
Ghimire, PhD, which can be found on GitHub https://github.com/ghimiredhikura/Complex- 
YOLOv3, Jan. 2020. 
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we only have one LiDAR without any tilt and the position of the LiDAR is 
the origin of the spherical coordinate system the angles can be interpreted as 
the angles of the laser beam during the scanning of the scene and r as the 
distance between the object and the LiDAR. So the distribution of @ and @ 
depends on the LiDAR setup. Because the azimuthal LiDAR setup in KITTI 
and Lyft is similar, the distribution should be similar and only differ mostly 
due to geometric properties of the scene. Because 0 describes the polar angle 
it depends on the vertical resolution and vertical FOV, besides the scene, the 
distribution of 0 can be used to distinguish between the LIDAR setups. The 
distribution of r depends both on the scene and also the LiDAR setup. Since in 
KITTI there are more points around the ego vehicle than in Lyft (see Fig. 1), 
the distribution has a higher peak by lower r. 


After the translation of KITTI to Lyft, the distribution of 0 should be more 
similar to Lyft than to the original KITTI. 


As metric to measure the similarity of two distributions we use the well known 
Wasserstein distance also known as earth mover’s distance [15, 16]. The Was- 
serstein distance comes from the transportation theory and for histograms it can 
be intuitively interpreted as cost to transform on histogram into the other. 


3 Results 


We train CycleGAN on a split of 7000 BEV maps for KITTI and 7000 BEV 
maps for Lyft. The BEV maps are created for the four level of information 
Position, Height, Height+Density and Height+Density+Intensity. 


Figure 2 shows qualitative results of the translation for Position. It can be seen 
that in the translated KITTI BEV maps the number of white pixels close to 
the ego vehicle is decreased, like in Lyft BEV maps. Also some circle lines in 
the translated BEV maps are deleted and the number of circle lines match the 
number from Lyft. Further, most of the shapes are preserved for most cars, but 
for some cars the number of white pixels decrease. So optically these examples 
show promising results in the translation of the characteristics of the LIDAR 
BEV maps. 
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Lyft 


Figure 2: Qualitative results for the translation from KITTI to Lyft, for the information level 
Position. 


For quantitative results we use the described evaluation strategies Usability 
and Structure (Sec. 2.3). 


For Usability ComplexYOLO was trained on a Lyft split of 7000 BEV maps 
for the four different BEVs information levels to detect cars. Table 2 shows 
the average precision of the detector on test splits from Lyft, KITTI and the 
same KITTI scenes translated. For real Lyft BEV maps the detector has a 
good average precision over 0.95. For KITTI BEV maps it drops to an average 
precision between 0.53 - 0.6. After translation the average precision of the 
object detector gets a bit worse, probably because some shapes of cars were 
destroyed. This happened, for example, when translating the BEV map (c) 
in Figure 2. In the original KITTI BEV map the two cars in Figure 2 (c) 
are clearly visible, in the corresponding translated map most of the car is 
missing and could not be detected. Another potential problem is that after 
the translation pixel with small color values occur around the LiDAR points. 
These low color artifacts are barely visible to a human. In the first translated 
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Table 2: Results of the evaluation Usability. Average precision (loU=0.5) for the class car of an 
object detector that was trained on Lyft Level5 dataset and tested on Lyft Level5, KITTI 
and KITTI2Lyft translated with CycleGAN. 

P = Position; H = Height; D = Density; I=Intensity. 


P H H+D H+D+H 


Lyft 0.95 0.97 0.97 0.97 
KITTI 0.53 0.6 0.54 0.53 
KITTI2Lyft 052 054 051 0.47 


Table 3: Results of the evaluation Structure. Mean Wasserstein distance between different 
distribution of the spherical coordinates from the datasets. 
P = Position; H = Height; D = Density; I=Intensity. 


P H HD HDI 


OOO 0.082 02 02 0.12 
@kəL OK 012 017 017 017 
ƏkəL 6 0.06 0.096 0.096 0.093 

Krb 018 018 018 0.18 
dx ox 020 021 021 020 
dood, 020 021 021 0.20 

mon 444 535 535 5.35 
korg 718 5.60 641 5.25 
roar 407 3.60 3.88 3.42 


map of Figure 2 25454 pixel have a color value between | and 10 and 25415 
pixel a color value greater than 10. This number of pixels containing small 
values decreases if we encode more information in the color channels, like in 
Height, Height+Density or Height+Density+Intensity, but the performance 
of the detector does not increase as Tab. 2 shows. Also if we set all pixels with 
a color value smaller than 50 to 0, the AP only growth only a little to 0.53 in 
the case Position. So the influence of these small values can be neglected. 


For the criterion Structure we translate the coordinate system such that the Li- 
DAR is nearly the origin of the coordinate system and transform the coordinate 
system to spherical coordinates. We compare the distribution of 100 KITTI 
BEV maps, the corresponding translated BEV maps and Lyft BEV maps. The 
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BEV maps are reprojected and transformed to a spherical coordinate system 
and calculate the mean Wasserstein distance between the distribution of the 
spherical coordinate values (Tab. 3). According to the results of Table 3 the 
distribution of r for the translated KITTI BEV maps is always more similar to 
Lyft than to KITTI and also closer to Lyft, than Lyft to KITTI. The distribu- 
tion of ¢@ does not significantly change after the translation. Interestingly the 
addition of the information density and intensity has only little influence of the 
performance. 


4 Conclusions and Outlook 


In this paper we analyzed the possibility of unsupervised domain adaptation 
between two LiDAR datasets with different LiDAR setups. The demonstrated 
method of translating LIDAR BEV maps from KITTI to the structure of the 
Lyft LiDAR setup using CycleGAN showed optical promising results (Fig. 2). 
With our evaluation strategy Structure we could show that the KITTI point 
clouds really adopts the LiDAR setup characteristics of Lyft. Nevertheless 
Usability showed that the performance of our tested car detector could not 
be improved, even got a little bit worse, because some shapes of cars were 
damaged during the translation process. This confirms the assumption of Wang 
et al. [7] that the size of cars plays a more important role for the performance 
of object detectors, than the setup of the LiDAR. To improve the performance 
one has to improve the preserving of shapes that are interesting for the task, 
e.g. cars L-shape. But without the additional information to enlarge or shrink 
the size of cars, the performance of an object detector will likely not get much 
better. To analyze the influence of different LiDAR setups on object detection 
further study is needed. The CARLA [13] simulation environment can be used 
for the study of the setup influence, as we can choose in CARLA the number 
of channels and thus can record the same scene, with different LiDAR setups. 
So the object detector can be tested on the same scenes and only the LiDAR 
setup will change, hence if the setup has no influence the object detector should 
perform roughly as well, whether the same LiDAR setup is selected as in the 
training or a different LiDAR setup. 
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Next to the application of the translation between two datasets with different 
LiDAR setups, the method could also be used to transfer between different 
types of sensors. With Astyx HiRes a dataset with high resolution RADAR data 
is published. But since this is the only existent high resolution RADAR dataset 
at the moment and contains only a few scenes, training methods like object 
detection is challenging. With the demonstrated framework of translating the 
BEV maps, one could try to translate between the RADAR and LiDAR sensors. 
Because Astyx also provides LiDAR point clouds in addition to RADAR point 
clouds, one could also use supervised domain adaption methods. 
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Abstract 


With the share of renewable energy sources in the energy system increasing, 
accurate wind power forecasts are required to ensure a balanced supply and 
demand. Wind power is, however, highly dependent on the chaotic weather 
system and other stochastic features. Therefore, probabilistic wind power 
forecasts are essential to capture uncertainty in the model parameters and input 
features. The weather and wind power forecasts are generally post-processed 
to eliminate some of the systematic biases in the model and calibrate it to 
past observations. While this is successfully done for wind power forecasts, 
the approaches used often ignore the inherent correlations among the weather 
variables. The present paper, therefore, extends the previous post-processing 
strategies by including Ensemble Copula Coupling (ECC) to restore the de- 
pendency structures between variables and investigates, whether including the 
dependency structures changes the optimal post-processing strategy. We find 
that the optimal post-processing strategy does not change when including ECC 
and ECC does not improve the forecast accuracy when the dependency struc- 
tures are weak. We, therefore, suggest investigating the dependency structures 
before choosing a post-processing strategy. 
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1 Introduction 


As the share of renewable energy sources in the energy system increases, wind 
power forecasts become essential to guarantee balanced supply and demand. 
However, wind power highly depends on the chaotic weather system as well 
as other stochastic features and thus modelling uncertainty in these forecasts 
is important [1]. Probabilistic wind power forecasts aim to capture the uncer- 
tainty inherent in the model parameters and input features. Capturing this un- 
certainty is not easy and weather predictions, for example, are known to be bi- 
ased and underdispersed. Meteorologists have therefore been post-processing 
ensemble weather predictions to describe the uncertainty more accurately 
[2, 3, 4, 5, 6, 7, 8,9, 10]. In this post-processing the ensemble weather pre- 
dictions are calibrated to the actual historical weather observations, eliminating 
some of the model biases. Transferring this approach to wind power forecasts 
yields promising results for handling the uncertainty in wind power forecasting 
models with uncertain weather features and forecast horizons of 3h-24h [11]. 


Phipps et al. [11] show that post-processing only the resulting wind power 
forecast is the best strategy with respect to forecast accuracy. However, their 
approach ignores dependencies between the weather variables as the post- 
processing is done using Ensemble Model Output Statistics (EMOS). This 
method involves fitting parametric distributions to the ensemble forecasts and 
sampling from them to generate post-processed forecasts. Random sampling 
causes inherent correlations, so-called dependency structures, between varia- 
bles to be lost, which could affect the forecast performance. The present paper, 
therefore, extends the previous post-processing strategies by including Ensem- 
ble Copula Coupling (ECC) to restore the dependency structures between the 
variables and investigates whether including the dependency structures chan- 
ges the optimal post-processing strategy. In contrast to other approaches, we do 
not investigate the temporal [12] or spatiotemporal [13] dependencies. We also 
use a parametric approach unlike the non-parametric methods used in [14]. 


The extended post-processing strategy is evaluated on two data sets with both 
a linear regression and an artificial neural network used as forecasting models. 
Both the ability of ECC to restore the dependency structures between the vari- 
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ables, as well as the resulting forecast accuracy are considered as performance 
measures. 


The remainder of the present paper is structured as follows. Firstly we in- 
troduce theoretical concepts including ECC in Section 2. We then discuss 
the altered post-processing strategies (Section 3) before evaluating the effect 
of ECC in Section 4. We discuss the results in Section 5 before Section 6 
concludes. 


2 Background 


Before evaluating the effect of ECC on wind power ensemble post-processing, 
we introduce EMOS, ECC, and also discuss the forecasting models we use. 


2.1 Ensemble Post-Processing 


The weather ensembles from the Ensemble Prediction System (EPS) are known 
to be biased and underdispersed, and thus need to be calibrated [2]. Not 
accounting for this bias or under estimated variance could lead to false wind 
power forecasts which affect the stability of the energy system. The present 
paper applies EMOS developed by Gneiting et al. [6] to perform this calibra- 
tion. EMOS is based on non-homogenous regressions and performed for each 
weather variable individually given a single origin and set forecast horizon. 


The standard EMOS approach is designed for ensemble members that are 
individually distinguishable. Ensemble members from the European Centre 
for Medium-Range Weather Forecasts (ECMWF) are, however, classified as 
exchangeable, thus representing equally likely future scenarios without distin- 
guishing features [15, 16]. Given exchangeable ensembles x1, ...,x,y, we apply 
EMOS from Gneiting and Katzfuss [5], where the weather variable y with an 
assumed normal distribution is modelled as 


M 
lsa N fatb E amet), (1) 


m=1 
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with a,b,c and d being regression coefficients, and the variance S? being a 
linear function of the ensemble spread with 


len) & 


We also use the same approach with a truncated normal distribution. Since we 
post-process each weather variable for a single origin and set forecast horizon, 
the EMOS coefficients change when any one of these three parameters are 
altered. We apply EMOS using a rolling calibration window of 40 days with 
the help of the scoringRules! package. For more information on EMOS and 
its application in meteorology see Gneiting et al. [6] or Gneiting [17]. 


2.2 Ensemble Copula Coupling 


With the EMOS method introduced above, we now have ensembles that are 
calibrated to the past data. All weather variables are modelled with a univariate 
distribution. We could sample from each of these distributions independently 
and input the information into the wind power forecasting scheme. However, 
the original Numerical Weather Prediction (NWP) includes information on 
the weather variables dependencies among each other as well as in space and 
time. This information gets lost with the univariate EMOS approach. In order 
to retain these dependencies, several empirical copula-based approaches have 
been developed, which we want to introduce in more detail in this section. 


A d-dimensional Copula is a multivariate cumulative distribution on the unit 
cube [0,1] with uniform margins [18, 19]. The importance of copulas in 
restoring dependency structures is based on the theorem of Sklar [20]. He 
states that for any multivariate cumulative distribution function F with margins 
F\,...,Fy there exists a copula C, that is unique on the range of the margins 
and has the form 


F (x1...,xm)=C(F (x1),---Fiv (xm)), (3) 


U https://cran.r-project.org/web/packages/scoringRules/scoringRules.pdf 
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for x1,...,%m € R. With regards to ensemble calibration, we already have the 
uniform margins F,,..., Fj in the form of the EMOS univariate distribution. 
Therefore, Sklar’s theorem states that as long as an appropriate copula is defi- 
ned, univariate ensemble post-processing techniques can be used to accommo- 
date any dependency structure. 


The ECC approach, is based on the mathematical concepts defined above 
with the appropriate copula in the form of a reordering process. The idea is 
that given a dependence structure “template” [21], the samples drawn from 
the univariate EMOS distributions can be reordered in such a way that they 
resemble the initial correlation structures that were given in the NWP. The 
templates are based on the raw ensembles, where we assume that the ensembles 
capture the correlations. While several variants exist, we take a closer look 
at random ECC (ECC-R), quantile ECC (ECC-Q), and transformation-based 
ECC (ECC-T). These methods differ from each other in the way they sample 
from the distributions and whether they include a reordering step. Firstly, we 
take a look at an ECC approach based on random draws. Here, we sample with 
the independent standard uniform random variates u1,...,um, such that 


& =F (m)... =F (um). (4) 


In this ECC-R approach, the samples have to be reordered according to the tem- 
plate depicted by the raw ensembles. This template is based on the rank struc- 
ture. Given each time horizon, location and variable and following the notation 
by Schefzik et al. [21], the raw ensembles x1,...xm and their order statistics 
xa) S «+» < xm) construct a ranking permutation m, with 
z(m) := rank(xm) for me {1,...,M}. After drawing the samples from the 
distribution, they are reordered according to 7. The ECC approach based on 
quantiles involves the same reordering step as in the ECC-R approach, the 
sampling method is however different. In the case of ECC-Q, the samples are 
drawn from equally spaced quantiles, such that 


1 M 
Her Ga) (5) 
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and then reordered as before. In contrast to the ECC-R and ECC-Q approach, 
ECC-T relies on a transformation and does not require an additional reordering 
step. The samples are drawn such that 


% = F'(S(x1)),..-,8u = F7! (S(xm)), (6) 


where S is the fit of a cumulative distribution function to the raw ensembles 
[21]. The choice of S depends on the variables in question and in the case of 
temperature, pressure and wind component vectors, S can be assumed as nor- 
mal with mean equal to the ensemble mean and variance equal to the ensemble 
variance [3]. 


2.3 Forecasting Models 


The present paper focuses on the effect of ECC on optimal post-processing 
strategies and not developing state-of-the-art wind power forecasts. Therefore, 
we use the same forecasting models (a linear regression model and a neural 
network), and the same forecasting strategy as in [11]. Both are described in 
the following together with how we measure the forecasting accuracy. 


2.3.1 Linear Regression 


The simplest models, which we use to forecast wind power, are linear regres- 
sion models. These linear regression models can be described with 

J 
‚> VDI n + Eth, (7) 
j=l 


K 
Yith = Bo + Oy+n-24 + y Bew, + 

k=1 
where y; is the dependent variable, which in this case is the wind power, 
y;~24 is the actual wind power a day before, W* are weather time series, such 
as wind speed and temperature, and D/ are dummy variables, such as the 
season, the month and the year. The models are fitted for each forecast ho- 
rizon h=h,...hy with h < 24 using actual historical weather data in order to 
describe the real relationship among the variables and remove any bias fitting 
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on historical weather forecasts or ensembles could introduce. Each ensemble 
xı ...xy from the EPS is then used in a separate prediction run for each forecast 
horizon to generate an ensemble of wind power predictions with the previously 
fitted regression coefficients 


K J 
Sr4n(X1,---Xu) = Bo + Ayıtn24 +), BWE p(x,- xm) + L PD +8, 
ei 


= j=l 
(8) 
where Wi „ 15 the weather forecast made at time f for the forecast horizon A. 


2.3.2 Neural Network 


In order to better forecast non-linear dependencies, we implement a neural 
network. We tested multiple neural network configurations before selecting 
a configuration with two hidden layers of 10 and 7 neurons respectively and 
trained it with the resilient backpropagation algorithm. This network architec- 
ture was selected because it is the simplest we found, that still returns accurate 
forecasts. The chosen activation function is a hyperbolic tangent given by 


et] 
~ eer 4] 


a(x) €[-1,]]. (9) 
The input features remain the same as for the linear regression model explained 
above. Again, the parameters (i. e.weights) are fitted using the actual historical 
weather data and each ensemble member is passed through the network to get 
an ensemble wind power prediction. The neural networks are implemented in 


R with the neuralnet package”. 


? https://cran.r-project.org/web/packages/neuralnet/neuralnet.pdf 
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2.3.3 Forecasting Accuracy 


To evaluate the forecasting approaches, we use the Continuous Ranked Proba- 
bility Score (CRPS). This error measure is used to assess the calibration and 
sharpness of the probabilistic forecast and can be described as follows 


cRPS(F,y) = | (FO -Hy <z}) az (10) 


with F being the wind power generations predictive cumulative distribution 
function, y the verifying observation and 1 denoting an indicator function. We 
report the score over all time steps = 1,...N in the test set 


1 N 
CRPS = — ) CRPS(F ; 11 
NÈ ( ryt) ( ) 


3 Post-Processing Strategies 


The present paper focuses on determining whether including ECC into the 
post-processing of wind power forecasts affects the performance of these stra- 
tegies and changes the optimal post-processing strategy. Four post-processing 
strategies were identified by Phipps et al. [11] and these are shown with the 
addition of ECC in Figure 1. We identify two strategies that can be extended 
to include ECC, whilst two remain the same: 


Raw. Using the raw weather ensembles directly in the forecast is not affected 
by ECC. Here we take all available M ensemble members from the EPS for 
multiple weather variables to generate the wind power forecast. Thus, the 
resulting wind power forecast consists of M members. 


One-Step-P. The second strategy identified by Phipps et al. [11] is also un- 
changed by ECC. The output of the wind power forecasting model is post- 
processed, without any previous calibration of the input variables. This as- 
sumes that post-processing the wind power ensembles also accounts for the 
biases in the weather ensembles. For a detailed description of the One-Step-P 
approach see Phipps et al. [11]. 
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One-Step-W. This strategy involves calibrating the raw weather ensembles be- 
fore they are used as inputs in the wind power forecast model. We initially post- 
process the weather ensembles analogue to Phipps et al. [11]: Each weather 
variable (temperature, wind speed, wind component vectors etc. ) is considered 
separately and post-processed using EMOS with a rolling calibration window. 
This results in probability distributions for each weather variable and we form 
new post-processed weather ensembles by sampling from these distributions. 
It is in this resampling stage that we apply ECC. Depending on the method (see 
Section 2.2) we either reorder the EMOS samples or also consider a different 
sampling strategy or transformation. As a result the strategy now has multiple 
variants. We use the resulting post-processed ensembles with restored depen- 
dency structures as inputs into the wind power forecast model. The resulting 
ensemble of wind power forecasts is not processed further. 


Two-Step-WP. The final strategy is also altered when we include ECC. In 
this strategy both one-step post-processing approaches are coupled together. 
We first post-process the weather ensembles using the altered One-Step-W 
strategy including ECC. These ensembles are used to generate a wind power 
ensemble forecast and then we again post-process the result as in the One- 
Step-P method. Since there is only one set of wind power ensembles, therefore 
not dependencies present, we are unable to apply ECC after the first EMOS 
application step. 


In the present paper we compare these approaches, focusing on discovering if 
the inclusion of ECC in One-Step-W and Two-Step-WP improves the forecast 
performance. Specifically through an evaluation on two data sets, we inves- 
tigate if ECC leads to one of these two strategies outperforming One-Step-P, 
which was previously found to perform best [11]. 


4 Evaluation 


We use two different data sets to evaluate the effect of ECC on the performance 
of the post-processing strategies described above. In this section we briefly 
introduce those data sets and present the results of our evaluation. 
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Figure 1: An overview of the post-processing strategies compared. Whilst no post-processing (Raw) and post-processing only the power ensembles 
(One-Step-P) are unaffected by adding ECC, we extend the other two strategies. Therefore we add ECC after the EMOS stage, when we 
post-process only the weather ensembles (One-Step-W). We also include ECC after the EMOS stage for the weather variables but before the 
post-processing of the power ensembles in the Two-Step-WP strategy. 
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4.1 Data 


We evaluate the post-processing strategies on two data sets: A benchmark 
data set including both an onshore and offshore wind park, and real data from 
bidding zones 3 and 4 in Sweden. This section briefly introduces the data 
used. 


4.1.1 Benchmark Data 


The benchmark data set is based on simulated wind power data based on real 
wind parks located in Germany. This data was simulating using the renewable 
ninjas? API, with the input data being selected to mimic real onshore and offs- 
hore wind parks in Germany as closely as possible. Staffell and Pfenninger [22] 
verify that the simulation and bias-corrections implemented in the renewable 
ninjas API are capable of reproducing accurate wind power time series. A 
detailed description of the parameters used in the simulation is provided by 
Phipps et al. [11]. 


We access open source weather ensemble data through The International Grand 
Global Ensemble (TIGGE) archive*. TIGGE archive is a result of The Ob- 
serving System Research and Predictability Experiment which aimed to com- 
bine ensemble forecasts from leading forecast centres to improve probabilis- 
tic forecasting capabilities [23]. Due to damaged tapes in TIGGE we only 
use data from February 2017 until August 2018, including the parameters 
two-meter temperature, surface pressure, 10m-U-Component of wind, 10m- 
V-Component of wind and wind speed. Weather data is downloaded for the 
same location as the synthetic wind park generated through renewable ninjas. 
We use the ERAS reanalysis data for the ground truth historical weather data 
[24]. We download the identical weather parameters for the same timespan 
via the Copernicus Climate Data Store (CDS) API 5, Data from 2017 is used 
for training the forecast models and from 01.2018-08.2018 for evaluation. For 


3 www.renewables.ninja. 
* https://apps.ecmwf.int/data sets/data/tigge/ 
5 https://cds.climate.copernicus.eu/home 
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more detailed information on the benchmark data set, including information on 
how to replicate it, see Phipps et al. [11] 


4.1.2 Swedish Data 


The Swedish electricity system is divided into four sub-areas or bidding zones 
with the present paper focusing only on the area contained in bidding zones 3 
and 4. We download the wind power generation data aggregated on a bidding 
zone level through the open-source transparency platform which is operated by 
the European Network of Transmission System Operators (ENTSO-E) [25]. 
This data is available at an hourly resolution, but due to limitations in the 
weather data we can only forecast every 3h. The weather data for bidding zone 
3 and 4 is made up of the ECMWF EPS Molteni et al. [15] and the ERA5 
reanalysis data C3S [24]. The ERAS data again serves as the ground truth 
for post-processing, whilst we use the EPS for the ensemble forecasts. We 
download the parameters two-meter temperature, surface pressure, 100m-U- 
Component of wind, 100m-V-Component of wind and wind speed. Since the 
weather data only comes in a grid-based format and we perform forecasts for an 
entire bidding zone, this weather data must be aggregated. We use a weighted 
average method to perform this aggregation. We use data from 2015-2017 
for training our forecast models and from 01.2018-08.2019 for evaluation. A 
detailed description of this data set, including the weighted average aggregation 
method is provided by Phipps et al. [11]. 


4.2 Results 


We evaluate both the ability of ECC to restore the dependency structures and 
the effect this has on forecast performance. This section presents the results of 
the analysis for both data sets introduced above. When analysing performance 
of ECC with scatter plots we only consider April 14, 2018 at 6:00am. The 
results for all other dates and forecasts horizons are similar and therefore not 
discussed in detail. 


98 Proc. 30. Workshop Computational Intelligence, Berlin, 26.-27.11.2020 


(a) Onshore Benchmark (b) Offshore Benchmark 


Figure 2: Scatter plots showing the raw dependency structures for both benchmark data sets on 
April 14, 2018 at 6am. Pressure (P), the U-component of wind (U-W), the V-component 
of wind (V-W), temperature (T) and wind speed (WS) are shown. Despite correlation 
between wind components and wind speed, the dependencies between the weather 
variables are not particularly strong. 


4.2.1 Benchmark Data 


First we consider the effect of ECC on restoring the dependency structures 
in the data. We see the dependencies between various weather variables in the 
form of scatter plots in Figure 2. These plots show a histogram of the empirical 
distribution of individual weather ensembles along the diagonal and scatter 
plots of their dependencies in the other positions. In the onshore benchmark 
there is a clear correlation between the V-component of wind and wind speed, 
and also a slight correlation between wind speed and the U-component of wind. 
There are no strong dependencies shown between any other weather variables. 
For the offshore benchmark data the only noticeable correlation is between the 
U-component of wind and wind speed. 


Figures 3 and 4 show how these dependency structures change after applying 
ensemble post-processing and various ECC methods. Figure 3 details the 
onshore benchmark and we see, that only applying EMOS removes all de- 
pendency structures. ECC-R is also not effective, with no dependencies being 
visible. Both the ECC-Q and ECC-T methods show improvement. The quan- 
tile sampling based ECC-Q leads to an almost symmetric marginal distribution, 
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(c) ECC-Q (d) ECC-T 


Figure 3: Scatter plots showing the dependency structures with various post-processing strategies 
that aim to restore the raw dependency structure of the onshore benchmark on April 14, 
2018 at 6am. Pressure (P), the U-component of wind (U-W), the V-component of wind 
(V-W), temperature (T) and wind speed (WS) are shown. We see that EMOS destroys all 
dependencies and ECC-R is not effective in restoring the structures. ECC-Q and ECC-T 
both restore the dependency structures effectively. 


but also accurately recreates the dependencies between the weather variables. 
Since ECC-T is based on a transformation it is not surprising that this method 
recreates the dependencies with the most accuracy. The results are similar for 
the offshore benchmark in Figure 4, with the exception of the ECC-Q method. 
Although we see some dependency structures being recreated, ECC-Q is not 
as effective here as in the onshore data set. 


In order to assess the effect of ECC on forecast accuracy we consider plots of 
the mean CRPS for each forecast horizon on each benchmark data set. Figure 
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(c) ECC-Q (d) ECC-T 


Figure 4: Scatter plots showing the dependency structures with various post-processing strategies 
that aim to restore the raw dependency structure of the offshore benchmark on April 14, 
2018 at 6am. Pressure (P), the U-component of wind (U-W), the V-component of wind 
(V-W), temperature (T) and wind speed (WS) are shown. We see that EMOS destroys 
all dependencies and ECC-R is not effective in restoring the structures. ECC-Q shows 
some improvements, but ECC-T restores the dependency structures the best. 


5 compares the means CRPS of One-Step-P against variants of One-Step-W. 
Figure 6 also plots the mean CRPS scores, but this time for variations of 
Two-Step-WP against One-Step-P. The One-Step-P strategy is almost always 
slightly more accurate than the variations of One-Step-W and very similar to 
the Two-Step-WP variations. We also see, that there is almost no difference 
between the various variations One-Step-W and Two-Step-WP based on diffe- 
rent ECC methods. The neural networks perform worse than the linear models 
on the benchmark data. This is mainly due to a lack of training data but 
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Figure 5: Plot comparing the average CRPS score for the test data on the benchmark data set for 
each forecast horizon using the One-Step-W variants to the One-Step-P strategy. In (A) 
we see the linear model for the onshore data and (B) the neural network. (C) and (D) 
show the evaluation of the linear model and neural network on the offshore data set. 
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Figure 6: Plot comparing the average CRPS score for the test data on the benchmark data set for 
each forecast horizon using the Two-Step-WP variants to the One-Step-P strategy. In (A) 
we see the linear model for the onshore data and (B) the neural network. (C) and (D) 
show the evaluation of the linear model and neural network on the offshore data set. 


also suggest that a linear model is more than capable of delivering accurate 
forecasts. 


4.2.2 Swedish Data 


The effect of ECC on the dependency structures in the Swedish data is similar 
to the benchmark data set and therefore we do not discuss it here in detail. The 
mean CRPS values for bidding zone 3 are shown in Table 1 and for bidding 
zone 4 in Table 2. Again all One-Step-W variations deliver similar results and 
perform noticeably worse than the One-Step-P strategy. The Two-Step-WP 
performs similarly to the One-Step-P method, but again applying ECC has 
little effect. On a whole, we see that as with the benchmark data set ECC has 
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Table 1: Summary of mean CRPS for the test data in bidding zone 3 in Sweden. The best 
prediction for each strategy and each forecast model is highlighted in bold. 


Data Set 6h 12h 18h 24h 


Linear Raw 97.12 84.86 91.97 94.28 

Linear One-Step-P 64.12 68.90 6462 69.27 

Linear One-Step-W-E 107.95 95.28 103.85 100.43 

Linear One-Step-W-R 107.95 95.26 103.21 100.24 

Linear One-Step-W-Q 108.21 95.48 103.67 100.27 

Linear One-Step-WP-T 107.89 95.43 103.60 100.17 

Linear Two-Step-WP-E 63.70 6726 63.28 67.38 

Linear Two-Step-WP-R 64.50 6740 63.61 67.02 

Linear Two-Step-WP-Q 64.55 67.41 63.65 68.10 

Bidding Linear Two-Step-WP-T 63.58 67.78 64.04 68.03 
Zone 3 Neural Raw 64.35 70.73 72.55 66.99 
Neural One-Step-P 61.05 63.02 67.13 59.54 

Neural One-Step-W-E 73.40 64.81 63.26 74.45 

Neural One-Step-W-R 74.60 65.29 65.34 74.11 

Neural One-Step-W-Q 74.85 65.52 65.07 75.45 

Neural One-Step-W-T 74.79 66.18 64.52 74.38 

Neural Two-Step-WP-E 65.69 59.37 56.79 69.38 

Neural Two-Step-WP-R 6723 60.81 57.12 69.65 

Neural Two-Step-WP-Q 66.27 61.04 5729 70.87 

Neural Two-Step-WP-T 6751 6223 56.22 70.29 


almost no effect on the forecast accuracy, with all ECC variants performing 
similarly to post-processing strategies where only EMOS is used. In the case 
of the Swedish data we also note, that the neural network performs slightly 
better than the linear model in bidding zone 3. 


5 Discussion 


The fact that the ECC variants perform differently is not surprising as simi- 
lar results are observed by Schefzik et al. [3]. Across all data sets applying 
EMOS destroys the dependency structures and ECC-R is relatively ineffective 
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Table 2: Summary of mean CRPS for the test data in bidding zone 4 Sweden. The best prediction 
for each strategy and each forecast model is highlighted in bold. 


Data Set 6h 12h 18h 24h 


Linear Raw 58.50 67.00 5651 51.54 

Linear One-Step-P 45.13 51.90 50.35 44.26 

Linear One-Step-W-E 59.48 60.39 55.97 50.73 

Linear One-Step-W-R 59.48 60.78 56.33 51.28 

Linear One-Step-W-Q 59.71 61.01 56.35 50.71 

Linear One-Step-W-T 59.46 60.80 56.33 50.72 

Linear Two-Step-WP-E 45.21 52.69 49.89 43.92 

Linear Two-Step-WP-R 45.00 52.75 50.48 43.48 

Linear Two-Step-WP-Q 44.54 52.38 50.11 43.41 

Bidding Linear Two-Step-WP-T 44.43 51.44 50.23 43.34 
Zone 4 Neural Raw 52.74 46.70 51.02 43.98 
Neural One-Step-P 49.55 47.80 48.11 46.22 

Neural One-Step-W-E 82.29 90.97 100.03 82.26 

Neural One-Step-W-R 81.34 88.97 98.62 80.29 

Neural One-Step-W-Q 78.32 86.29 95.83 78.22 

Neural One-Step-W-T 74.79 82.88 89.79 73.70 

Neural Two-Step-WP-E 52.06 58.04 63.87 49.42 

Neural Two-Step-WP-R 51.79 56.92 62.82 49.39 

Neural Two-Step-WP-Q 52.48 57.19 63.47 49.00 

Neural Two-Step-WP-T 53.27 56.67 61.52 49.35 


in rebuilding dependency structures. Since ECC-Q relies on quantile sam- 
pling, it produces marginal distributions that are always close to symmetric. 
With regards to dependency structures, ECC-Q delivers mixed results. For 
the onshore benchmark (and also bidding zone 3) it recreates the dependency 
structures almost as accurately as ECC-T, but not for the other data sets. This 
could be due to ECC-Q still relying on sampling from the EMOS distribution, 
which we obtain by considering the last 40 days. Therefore it is possible 
that the EMOS parameters vary in accuracy which could lead to a ECC-Q 
sampling that is also slightly worse. ECC-T on the other hand is based on a 
monotonic transformation and it is therefore expected that it always accurately 
recreates the dependency structures of the raw ensembles. ECC-T therefore 
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unsurprisingly performs the best in this regard, but the trade-off is the marginal 
distributions, which are not as symmetrical as those from ECC-Q. 


The key focus of the present paper is however, evaluating how the inclusion 
of ECC affects post-processing strategies. Although, as discussed above, the 
performance of the various ECC variants differs, this appears to have no effect 
on the forecast performance. The only instance in which the ECC variations 
noticeably diverge is when the neural network is used in bidding zone 4 in 
Sweden with the One-Step-W strategy. In this case the One-Step-P method 
performs far better than any One-Step-W variation, and the EMOS variant of 
One-Step-W performs better than all ECC strategies. In this case ECC didn’t 
lead to any improvement, but actually caused the forecasts to be slightly worse. 
For other post-processing strategies and forecast horizons there is no noticea- 
ble difference between the One-Step-W or Two-Step-WP variations with and 
without ECC. For the forecast horizons and data sets considered, we therefore 
find that dependency structures do not play a significant role with regards to 
wind power forecast accuracy. 


We consider two possible explanations for this. Firstly, the dependency struc- 
tures present in both our data sets are not strong. Although Schefzik et al. [21] 
reported improved results when using ECC, they were working with data sets 
that showed clear dependencies. Since there are no strong dependency struc- 
tures for ECC to restore in our datasets, it makes sense that this does not lead 
to an improvement. We therefore suggest to investigate the dependency struc- 
tures before deciding whether ECC is included in the post-processing strategy. 
The second explanation is that the wind power forecast model is capable of 
implicitly restoring these dependencies. When the weather ensembles are used 
to generate wind power forecasts an implicit calibration (due to the training 
of the model with historical weather variables and observed/simulated wind 
power generation) occurs and this could be sufficient to negate the effect of the 
missing dependency structures. 
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6 Conclusion 


The present paper investigates whether including Ensemble Copula Coupling 
(ECC) into different post-processing strategies affects which of the strategies 
is optimal. We show that ECC, particularly ECC-Q and ECC-T, restores the 
dependency structures effectively, but this does not affect the forecast per- 
formance. The strategies post-processing the weather variables only (One- 
Step-W) or both the weather variables and the wind power (Two-Step-WP) 
deliver almost identical results, regardless of whether ECC is used or not. 
Given these results, we conclude, that ECC does not change the optimal post- 
processing strategy for wind power forecasts. Due to the smaller number of 
post-processing steps required and superior or similar forecast accuracy, the 
strategy post-processing only the resulting wind power (One-Step-P) remains 
optimal. However, our data does not contain strong dependency structures 
which limits the potential of ECC. Therefore, we recommend investigating the 
dependency structures before selecting a post-processing strategy. 


Future work should focus on investigating data with different dependency struc- 
tures to understand when ECC plays a role. Additionally, while the present 
paper focuses solely on dependencies between weather variables, future work 
should focus on recreating the spatial and temporal dependencies with ECC 
and analysing their effect on forecast performance. 
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Abstract 


SARS-CoV-2 is a highly contagious virus that can induce pulmonary compli- 
cations like viral pneumonia and acute respiratory distress syndrome (ARDS). 
In order to support RT-PCR testing, chest X-rays are used to identify the pre- 
sence of COVID-19 in lungs. Radiologists proficient in chest X-Ray (CXR) 
interpretation are scarce, motivating our work of examining the performance 
of convolutional neural networks (CNNs) on this task. CNNs are the state-of- 
the-art image classification method. In this work we classify X-rays into four 
classes (COVID-19, other lung opacity, other diseases and normal). We find, 
that MobileNetV 1 is the best CNN for this task and achieve an overall accuracy 
of 70% and a COVID-19 accuracy of 83% on a test data set. By increasing the 
number of images of the unrestricted classes normal, lung opacity and other, 
the COVID-19 accuracy can be increased to 95% and the overall accuracy 
to 78%. We leverage model interpretability techniques and provide attention 
heatmaps that can assist in validating the model’s decision process. 


1 Introduction 


The novel coronavirus disease 2019 (COVID-19) has resulted in an ongoing 
outbreak of viral pneumonia all over the world. By now more than 12.9 million 
people have been infected of which more than 570 thousand have died [1]. 
The containment of the disease is based on the identification of infected pe- 
ople, backtracing of infections, and isolation of contagious people. Due to 
the fact, that COVID-19 is easily confused with influenza, specific tests are 
necessary. There are different methods to test for the existence of SARS-CoV- 
2 such as an RNA-based assay using nasopharyngeal swabs (RT-PCR testing). 
Unfortunately, there were already shortages of swabs in the past and testing 
capacities reached their limits in various regions of the world [2]. Therefore, 
supporting methods for COVID-19 identification are needed. One alternative 
way of testing COVID-19 is thorax X-Ray imaging which can give immediate 
diagnostic information. Additionally, the stage of the disease and the risk 
of a severe course can be seen [3]. While thoracic X-Ray can be conducted 
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rapidly, the scarce availability of analyzing radiologists depict a key bottleneck 
toward accelerating automatic differential diagnosis. Therefore, the study of 
automatic image classification approaches for the identification of COVID- 
19 is an emerging research topic. This work focuses on convolutional neural 
networks (CNNs). Our contributions can be summarized as follows: Using 
transfer learning, different CNN base models for image classification tasks 
such as Xception, ResNet50, InceptionV3, MobilenetV1, and MobilenetV2 are 
compared with well-known classification performance metrics. The analysis 
finds, that MobileNetV1 is the best CNN for this task and achieved an overall 
accuracy of 70% and a COVID-19 accuracy of 83% on a test data set. These 
results are based on a relatively small training data set (each class n=1000, 
except COVID-19 n=538). By increasing the number of images of the unre- 
stricted classes normal, lung opacity and other, the COVID-19 accuracy can be 
increased to 95% and the overall accuracy to 78%. This work also indicates that 
there may be unique features in thorax X-Ray images of COVID-19 infected 
people which differ from other forms of pneumonia. This raises new questions 
for further research such as why some cases of COVID-19 are relatively easy 
to identify and what are the key characteristics on which a CNN determines 
the presence of COVID-19. Future work on these research questions might be 
helpful to obtain more understanding of COVID-19 which could contribute to 
the development of further treatments. 


This work is structured as follows: We give an explanation of convolutional 
neural networks and transfer learning in Chapter 2. In Chapter 3 we survey 
relevant work in the field of medical image recognition with focus on the 
detection of COVID-19. Chapter 4 presents an overview of the data which 
we used for our experiments. Chapter 6 revolves around the structure and 
results of our experiments 5. We discuss our results and findings in Chapter 6. 
Finally, Chapter 7 summarizes the outcome of this work, and future research is 
proposed. 
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2 Related Work 


2.1 Convolutional Neural Networks 


Convolutional neural networks (CNN) are a state-of-the-art method for image 
classification. The advantage of CNNs to standard (deep) feed-forward net- 
works is that they can work with two-dimensional input data. This allows 
CNNs to capture spatial dependencies of an image and reduce parameters as 
well as improve the reusability of weights [4]. CNNs receive matrices as input. 
Different layer types process the information. The most important ones are the 
convolutional layer, the fully connected layer, and the pooling layer. 


The convolutional layer is the primary building block of CNNs. In the convo- 
lutional step, an input image is multiplied (dot product) with a weight matrix 
(filter) to generate a feature map representing one feature of the input image. 
Subsequent pooling steps reduce the size of feature maps while keeping the 
essential information. Pooling is a filter that strides over the given feature map 
and returns only a single value. 


The last layers of a CNN are usually fully connected. Within these layers, every 
neuron is connected to every other neuron. Because fully connected layers can 
only process vectors, the last feature maps are flattened. The structure of a 
typical CNN is shown in Figure 2. 


2.2 Transfer Learning 


Many medical image classification tasks suffer from a low number of cases. 
Transfer learning can be used in such situations. Transfer learning describes the 
process of not training CNNs from scratch but to employ pre-trained models 
instead. The training time reduces dramatically as the weights of the most 
layers are rendered immutable, and only the last layers are re-trained. The 
results are excellent since most image features (e.g., edge detection) can be 
reused on related problems without modifications [5, 6, 5]. 
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Figure 1: Comparison of COVID-19 glass-opacity in CT scan (left) and X-ray (right). Left image 
from Caruso et al. [16] right image from actual data set 


3 CNN for Medical Image Recognition 


The past few years have witnessed a rise of deep learning methods in medical 
image analysis [7]. CNNs now match or surpass the performance of humans 
in disease detection across a rich set of tasks [8] such as mammography de- 
tection [9], diagnosing pulmonary conditions from X-Ray [10], and computed 
tomography (CT) data [11]. 

COVID-19 induces patterns of viral pneumonia that can be identified through 
imaging techniques such as chest X-rays, CT, lung ultrasound. The most 
prevalent patterns are ground-glass opacity (GGO) and patchy consolidations 
in CT and X-Ray, and B-lines and pleural line abnormalities in ultrasound (for 
a detailed review see [12]). Throughout 2020, several hundred publications at- 
tempting to automatically detect COVID-19 from imaging data appeared [13]; 
overviews about individual methods can be found in [14, 15]). In the following, 
results of all three methods are briefly summarized. 


3.1 COVID-19 Detection in X-rays 


In current research there are two publications outstanding to the others. One 
was written by Narin et al. [18], who achieved the overall best score of 97% 
accuracy (COVID-19 sensitivity = 96%) with a pretrained ResNet50 model 
on a binary classification task (COVID-19 / healthy). As data set they used 
50 X-rays of healthy and 50 of COVID-19 infected people [18]. The other 
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Figure 2: A typical convolutional network (Source: Albelwi and Mahmood [17], p. 5) 


is related to this seminar. Apostolopoulos and Mpesiana [19] conducted a 
comparison of multiple CNNs for classification into three classes (common 
pneumonia, COVID-19, and normal incidents). They compared pretrained 
VGG19, MobileNetV 1, Inception, Xception and Inception ResNetV2 on a data 
set containing 224 COVID-19, 700 common pneumonia and 500 healthy X- 
rays. They found VGG19 with the best overall accuracy of 92.85% (COVID- 
19 sensitivity = 92.85%) and MobileNetV 1 with the best COVID-19 sensitivity 
of 99.10% (accuracy = 92.85%) [19]. 


3.2 COVID-19 Detection in CT Scans 


With CT Scans it is possible to create cross-sectional images of the human body 
and thereby create a more holistic view of a specific body region. To achieve 
this, multiple X-rays are conducted from different angles, where detectors 
compute the image. Due to this procedural, more images from one patient 
can be generated. On the downside the amount of radiation exposure is much 
higher. In recent publications the biggest data set was generated by Chen 
et al. [20], who used 46.096 CT images from just 106 patients (51 COVID- 
19 positive). With that data, a U-Net++ model was trained to determine the 
presence of COVID-19. The model achieved an overall accuracy of 95.24% 
(COVID-19 sensitivity = 100%) on the test set. This model was also tested 
in combination with radiologists and improved their reading time of a CT 
scan by 65% (Chen et al. 2020). Another publication worth mentioning is 
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written by Butt et al. [21]. The authors achieved the best accuracy with 86.7% 
(COVID-19 sensitivity = 86.7%) on a three classes classification task (COVID- 
19, Influenza-A, irrelevant-to-infection). This accuracy was achieved by using 
a ResNet23-based model on a data set with 618 images (CT samples from 110 
patients with COVID-19, 224 with Influenza-A and 175 healthy people) [21]. 
We show that in case of binary and tertiary classification an accuracy and 
sensitivity of around 90% on both image types is possible. 


3.3 COVID-19 Detection in Lung Ultrasound (LUS) 


LUS has been repeatedly recommended by clinical authorities [22] but has 
been neglected by the ML community [13], partially due to data heterogeneity 
resulting from higher operator dependency. However, public databases with 
LUS recordings of different pathologies are emerging [23] and the authors 
achieve a specificty of 0.91 and sensitivity of 0.98 on COVID-19 detection 
in a 3-class classification incorporating also bacterial pneumonia and healthy 
controls [24]. Others focused on severity assessment of COVID-19 infection 
and achieve recall of 0.6 and positive predictive value of 0.7 [25]. 


4 Preparation of data 


The different classification models need a common data set to make the results 
comparable. Therefore, a training set and a test set are built. Both sets consist 
of front view chest X-ray images divided in four different classes: COVID-19, 
other lung opacities, other diseases, and healthy lungs (Figure 3). All but the 
COVID-19 images are randomly drawn from the Kaggle RSNA Pneumonia 
Detection Challenge (Kaggle 2018). The COVID-19 data set was built by 
Kalkreuth and Kaufmann [26] which contains multiple sources (162 images). 
This data set is expanded with images provided by radiologists of the Johannes 
Gutenberg-University Mainz (73 images), by the Charité — Universitätsmedizin 
Berlin (9 images) and by the Cohen database [27] (389 images). The structure 
of the training set (85% of the whole data set) and test set (15% of the whole 
dataset) is shown in Table 1. As visible the number of COVID-19 images is 
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(a) COVID-19 (b) Healthy Lung 
(c) Lung Opacity (d) Other Lung Disease 


Figure 3: Example images from the data set 


much smaller than the number of images in the other classes. On the given 
set the CNNs are trained and compared to identify the best one which is then 
optimized. On the one hand this produces an overview of the performance 
of different CNN and on the other hand it shows which model is the most 
promising for further fine tuning. 


5 Experiments 


5.1 Overview of Experiments 


Figure 4 gives an overview of all experiments carried out. The first experiment 
was conducted to identify suitable hyperparameters. MobileNetV2 was chosen 
as the reference model due to its relatively small size. This introduces bias to 
the experiments because a good learning rate might differ between different 
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Table 1: Structure of the training and test dataset 


Class Training Set Validation Set 
COVID-19 538 95 
Other lung opacity 1000 176 
Other lung disease 1000 176 
Healthy lung 1000 176 
Va Process N 


Experiment 1 Experiment 2 Experiment 3 


Unsystematic 
hyperparameter 


Comparison of all Optimization of best 
— models with same — £ 
search on hyperparameters nee 
MobileNetV2 a 


Figure 4: Process of experiments 


models. In the second experiment the different models are compared with the 
given hyperparameters of the first experiment. The last experiment revolves 
around the optimization of the best model from the second experiment. In 
the following, each chapter describes the experiment in detail as well as its 
results. 


5.2 Identification of Hyperparameter (Experiment 1) 


The first experiment was used to identify the hyperparameter which work good 
for initial comparison. For this experiment the pretrained MobileNetV2 was 
used as the base model expanded by some custom layers which is shown in 
Figure 5. 


Each set of hyperparameter settings (as well as the experiments later) should 
been conducted with a cross validation of at least five. Unfortunately, at the 
time of this work there was not enough computational power available. The- 
reby, the experiments should be re-evaluated with cross validation when pos- 
sible. The results have been compared on the best validation error through 
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Pretrained MobileNetV2 


Y 


Max Pooling (4,4) 


Flatten 


4 


Dense (1024) 


4 


Dropout (0.3) 


Dense (4) 


Figure 5: Additional layers on top of the pretrained model 


training. Following hyperparameter have been sighted: the Adam learning 
rate, number of epochs, data augmentation, dropout rate and unfreezing of 
layers. The outcome indicates that a learning rate of 0.001, 60 epochs with 
a completely frozen pretrained model and nearly no data augmentation are 
good starting values to compare the different CNNs. Even though just little 
optimization has been conducted, the MobileNetV2 already has an accuracy of 
64.8% on the validation set. 


5.3 Benchmark of CNNs for COVID-19 identification 
(Experiment 2) 


As a next step the five CNNs (Xception, ResNet50, Inception V3, MobilenetV 1, 
MobilenetV2) are compared with following Hyperparameters: 


Just as in the experiments before the validation accuracy was used to compare 
the performance of the different base models. Not until experiment three, we 
have a look at the actual performance on the test set. All experiments were 
conducted with a cross validation of five and the mean is plotted. Figure 6 - 
Figure 10 show the training and validation accuracy for each model with Fi- 
gure 11 as a summary. The three best performing models, regarding maximum 
validation accuracy over all epochs are MobileNetV1 (68.46%), Inception V3 
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Table 2: Hyperparameter for Comparison 


Hyperparameter Value 

Optimizer Adam 

learning rate 0.001 

dropout rate 0.3 

epochs 60 

pretrained ImageNet 

freeze Completely frozen 


data augmentation Rotation range = 10 


(66.17%), and Xception (60.54%). Despite the max validation accuracy, we 
decided to optimize the MobileNetV2 due to its low training accuracy. It 
seems like an increase in the amount of epochs will also increase the overall 
accuracy which makes it much easier to optimize. Another side effect is the 
small size of the network which might be able to run on mobile x-ray devices. 
Thereby, MobileNetV2 is selected as the most promising CNN to optimize its 
hyperparameter systematically in experiment three. 


5.4 Optimization of MobileNetV2 (Experiment 3) 


In the last round of experiments, we try to optimize the MobileNetV2 as far as 
possible regarding the overall accuracy and then have a look at the performance 
on the test set. Equally to the previous experiment, cross validation of five was 
used. All hyperparameter were examined one after each other, consequently 
it can be possible, that other combinations might even increase the accuracy 
further, but this topic remains as a subject for following research. First, the 
learning rate of the adam optimizer was optimized, followed by the dropout rate 
and the number of frozen layers. The results are summarized in the following 
table, with best parameters in bold (Table 3 - Table 5): 


These hyperparameter where used to train the final model and test it on the 
never seen test set. Due to the best performance in the second experiment, 
we also test the MobileNetV1 on the test set. For testing the overall best 
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Figure 6: Performance result of MobilenetV1 
MobilenetV2 
1.0 
— Accuracy 
—— Val_Accuracy 
0.8 
0.6 
> 
je] 
g 
H] 
& 
0.4 
0.2 
0.0 T T T T T T T 
0 10 20 30 40 50 60 
Epoch 


Figure 7: Performance result of MobilenetV2 
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InceptionV3 
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Figure 8: Performance result of InceptionV3 
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Figure 9: Performance result of ResNet50 
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Xception 
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Figure 10: Performance result of Xception 
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Figure 11: Overall performance of models 
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Table 3: Results from the optimization of the learning rate 


Learning Rate Val. Accuracy 


0.001 0.5848 
0.0025 0.3946 
0.005 0.3206 
0.0075 0.2784 
0.00075 0.606 
0.0005 0.582 


Table 4: Results from the optimization of the dropout rate 


Dropout Rate Val. Accuracy 


0.25 0.5467 
0.2 0.5416 


epoch was used. Figure 12 and Figure 13 show that even though we opti- 
mized MobileNetV2 it performs slightly worse than MobileNetV1. Both CNN 
were able to distinguish COVID-19 and normal lungs from the other classes 
peaking in a COVID-19 accuracy of 83% (MobileNetV2 78%). Both networks 
had major problems to differentiate lung opacities from other lung problems. 
Which leads to lower average accuracy of 70% for the MobileNetV1 and 64% 
for the MobileNetV2. 


On the upside, the problematic classes are not restricted in terms of images 
therefore the same models were trained again on a bigger training data set. For 
each class except COVID-19, 2000 images instead of 1000 images were used. 
After training the new models were tested on the same test set. The results are 
summarizes in Figure 14 and Figure 15. 


The increase in all classes except COVID-19 leads to an increase in all KPIs 
and raises the accuracy for COVID-19 detection up to 95% (MobileNetV2 
91%). 
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Figure 12: Classification result of MobileNetV1 using 1000 images in each Non-COVID-19 class 
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Figure 13: Classification result of MobileNetV2 using 1000 images in each Non-COVID-19 class 


126 Proc. 30. Workshop Computational Intelligence, Berlin, 26.-27.11.2020 


Confusion Matrix 
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Figure 14: Classification result of MobileNetV1 using 2000 images in each Non-COVID-19 class 
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Figure 15: Classification result of MobileNetV2 using 2000 images in each Non-COVID-19 class 
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Table 5: Results from the optimization of the unfreeze level 


Unfreeze Val. Accuracy 


1 0.4087 
2 0.4008 
3 0.3847 


5.5 Interpretability analysis 


Model interpretability is paramount in healthcare applications, because in stan- 
dard medical treatment, an explanation, accompanying the diagnosis, is requi- 
red from a physicians. As an instance of post-hoc, gradient-based interpretabi- 
lity methods, GradCAM [28] is a technique that uses the gradients of any target 
layer (usually the last convolutional layer) to compute a localization heatmap 
highlighting the most informative parts of an input image given a class label. 


Following the training of the MobileNetV2 model, we computed CAMs of 
all images in the test dataset and provide a few exemplary maps in Figure 16. 
While it can be seen that the model successfully learned to highlight pulmonary 
markers like GGOs in, e.g., the COVID-19 example, it is evident from the 
negative examples that the model is prone to artifcats and occasionally focuses 
on irrelevant regions such as borders. 


6 Discussion 


Previous experiments show that the chosen CNNs can identify COVID-19 
infected lungs rather good on the test set, despite the small training set. This 
chapter is dedicated to point out different possible reasons for the examined 
behaviour. First, COVID-19 has some unique features to other infections and 
thereby can be identified easier than other classes. This theses would be sup- 
ported by the fact that also normal lungs have a quite good accuracy compared 
to the classes lung opacitiy and other where the images are much harder to 
distinguish. Another reason for the good identification of COVID-19 might be 
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A Correct CAMs B Inorrect CAMs 


COVID-19 Lung opoacity COVID-19 


Healthy 


Figure 16: Visualizations of class activation maps with GradCAM [28]. Subfigure A shows four 
samples from the test dataset with CAMs focused on the chest. For example, the 
heatmap in the COVID-19 is highlighting GGOs. Subfigure B depicts two cases where 
the CAMs are not aligned to pathologically relevant areas. 


the underlying data. Due to its novelty COVID-19 images are rare and available 
data is more homogeneous than the data in the other classes. This can lead to 
a bias in terms of accuracy where the CNN predicts COVID-19 always when 
the pictures differ from the homogeneous data from all other classes combined. 
Then it might not learn the features of COVID-19 but other differences such 
as the quality of the image to classify. The presence of this problem at least to 
some degree is supported by the last experiment with increased samples for all 
classes but COVID-19. Even though, the amount of COVID-19 samples stayed 
the same, its accuracy increased by 12% (MobileNetV1). Since this finding 
has been a major problem for our work, we started to analyze the predictions 
with the help of GradCA and found that our model can, to some extent, learn to 
focus on the relevant regions in the image. Although we also found a significant 
amount of samples with a mislocated attentional focus, it is encouraging to see 
the model extracting spatial biomarkers in a self-supervised way and without 
any segmentation masks. However, in a next step, lung segmentation with a U- 
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Figure 17: Lung segmentation 


Net segmentation model [29] will be used to exclude identification by features 
other than the lung. An example is shown is Figure 17. 


7 Conclusion and future work 


In this work different CNN’s were compared and the most promising (Mobile- 
NetV2) was optimized. Unexpectedly, MobileNetV 1 still performed better and 
peaked in an overall accuracy of 78% and a COVID-19 accuracy of 95% the 
model was able to detect COVID-19 infected lungs. These scores are relatively 
high especially for COVID-19. This increases the risk of the probability that 
the underlying data mainly generated from two sources introduced some bias. 
This work raises multiple questions, which can be examined in further research. 
One topic might be why the cases of COVID-19 are relatively easy to identify 
and in combination with that what are the key characteristics on which a CNN 
determines the presence of COVID-19. This research could help to further 
understand the virus and ultimately might help to find a cure. Another topic 
might be to compare the performance of CNNs with support vector machines 
or other image classification methods. 
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1 Introduction 


In many applications, from architecture to robotics, reinforcement learning 
approaches for stabilizing, controlling, and optimizing systems have been con- 
sidered as an alternative to classic control methods. Advantageously, reinfor- 
cement learning is a quite flexible framework for numerous control problems, 
allows including optimality conditions, and prior knowledge if available. 


Treating the control problem in the formalism of Markov decision processes 
(MDPs), strong convergence results on optimality are available for reinforce- 
ment learning algorithms (see e.g.[11, 4]) thus making reinforcement learning 
algorithms attractive. 


Even if significant progress has been achieved in the field of reinforcement 
learning over the past decade (see e.g. [7, 9]), solving real world control 
problems for nonlinear systems using reinforcement learning is still a very 
challenging and difficult problem. It is well known that reinforcement lear- 
ning approaches are subjected to the curse of dimensionality, see e.g. [8]. 
Thus, for practical solutions of possibly nonlinear systems, it is required to 
find reasonable approximations instead of the exact solution of the underlying 
Hamilton-Jacobi-Bellmann equation. 


In this contribution we apply and compare reinforcement learning approaches 
for the swing-up of the nonlinear cart-pole problem considering disturbances. 
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To this end, we use a classic agent-environment scheme (see e.g. [7]). The 
system, and in particular the swing-up process, is available as nonlinear ODE 
model derived from Lagrange formalism, which allows for efficient simulation. 
Particularly, we compare dynamic programming and temporal difference lear- 
ning for the swing-up of the cart-pole system. We evaluate the influence of the 
most important metaparameters (e.g. learning rate), as well es the granularity 
of the discretization on the learning process. 


We show applicability of reinforcement learning to the studied problem and 
discuss practical issues of considered approaches. In addition, we propose a 
strategy based on reward scheduling for solving advanced control problems, 
and discuss adaptive discretization as a tool to trade-off computational efforts 
and accuracy. Although we show that a simulation model speeds up the le- 
arning process significantly, the approaches studied here can be applied to a 
hardware only setup. 


The paper is structured as follows: In Section 2, the studied system and swing- 
up process is described, and in Section 3, the reinforcement learning approach 
is motivated. In Section 4, the results are presented. The paper concludes with 
a summary and discussion in Section 5. 


2 Studied system: Swing-up of the Cart-Pole 


2.1 Model 


As model, we consider the classic cart-pole inverted pendulum, derived from 
Langrangian mechanics. A scheme of the system is depicted in Figure 1, and 
is decribed by the ODE system: 


x4 = x2 (1) 

Bein —mal sinx3x2 +u+mz2gcosx3 SINX3 (2) 
Piz mı +m — mcos(x3)? 

= x (3) 

ge —m l cos x3 sinx3x2 +ucosx3 + magsinx3 + mıgsinxz (4) 


I(mı +m —mcos(x3)?) 
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Figure 1: Scheme of the considered Cart-Pole system. 


Hereby, x; denotes the carts position, x2 the carts velocity, x3 the angle @, 
and x4 the angular velocity @, and finally u denote the systems input in terms 
of a force applied to the cart in positive x-direction. The input is chosen for 
simplicity from the binary input set u € U:{-1,1}. 


2.2 System Simulation 


In order to apply the reinforcement learning algorithm successfully, we need to 
simulate the system repeatedly. More precisely, we need to compute the future 
systems states (fixed time horizon h = 0.2s) given varying initial conditions 
and inputs. This is achieved here using the Python-based OpenAI gym engine 
[3], considering the inverted cart-pole as described in [1], see Figure 1. The 
OpenAI gym provides an efficient numerical solution as well as a visualisation 
of the systems dynamics. 


2.3 Goal/Reward 


Classically, in terms of optimal control, for the described non-linear cart-pole 
system we aim to find an (optimal) input sequence so as to transfer the system 
from its the lower, stable equilibrium point 


(s* := {x1 =0,x2 = 0,x3 = 0,x4 = 0}) toward the upper equilibrium point (i.e. 
xı = const., x2 = 0, x3 = 180°, x4 = 0). Hereby, the horizontal position is 
constrained throughout the process by —/ < xı < L. 
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For the reinforcement framework, it is crucial to define a meaningful reward 
function, which is granted after each episode (try-out). Generally, the reward 
will depend on achievements made in an episode (and thus on the decisions 
made for the input), and is subsequently propagated backwards so as to learn 
from the current episode. 


Here, a successful run (swing-up) is simplified to the condition 
175° < x3 < 185°. Note that it is straightforward to add further (and more 
rigorous) conditions defining success, however for simplicity of presentation 
this is omitted here. Only in case of success as defined above, the reward R = 0 
is granted, otherwise R = —1. 


3 Reinforcement Learning Framework 


The reinforcement learning approach considered here is treated as episodic 
agent-environment interaction scheme as described by [7]. As key algorithms 
we consider the temporal difference learning, particularly TD(O) 1-step-Q- 
learning, see e.g. [9]. Figure 2 depicts how the agent (i.e. the AI choosing 
the current input) interacts with its environment (cart-pole system). 


3.1 initialization() 


At the beginning of each learning episode we initialize the algorithm by defi- 
ning the metaparameters (learning rate œ, exploitation rate €, discount factor 
A) as well as the simulation constraints (maximum time steps, epsiode number, 
discretization, ...). Note that we consider a finite uniform state space discreti- 
zation rather the infinite continuous state space. Particularly, we consider 4800 
discrete states. 


3.2 run() 


With the function run() we start the interaction of the agent with the environ- 
ment. First, the environment is reset (reset()), and the initial state s* together 
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Figure 2: Reinforcement structure scheme. 


with an initial input chosen by the agent is propagated to the environment. A 
simulation step is performed (step()), whereas the time-discretization of (1)-(4) 
with fixed step size h = 0.2s is considered (discretize()). The function encode() 
is optional and used in a modified algorithm not further considered here. Next, 
the current state and the current input is updated, another step is performed 
and a successor state is reached. The process is repeated until either success, a 
constraint violation, or the maximum number of time steps is reached. 


Of particular interest is the final or terminal state sr at the end of an episode, 
which decides upon the reward granted. Only in case of success a positive 
reward is granted, in all other cases (e.g. violation of constraints, time’s up,...) 
the reward is negative. Depending on the RL algorithm, at the end of each 
episode, the reward is back-propagated (new q(s,a)) so as to update the Q- 
table. Hence it is required to store the visited states within an episode. The 
here considered TD(0) algorithm however allows to update the Q-table at each 
step considering bootstrapping. 
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Figure 3: Influence of the learning rate œ. Exploration rate € = 0.1 is kept constant. 
4 Results 


We successfully applied the TD(O) algorithm to the cart-pole. To evaluate 
the influence of the learning rate, we evaluated the RL-approach considering 
different learning rates and compared the results see Fig. 3. Figure 3 shows 
the influence of the learning rate œ onto the average reward given obtained by 
averaging the reward given five independent runs. 


4.1 Influence of the learning rate 


As general conclusion, high learning rate is advantageous only at the begin- 
ning, in the long run however a lower learning rate leads to better performance. 
A plausible cause here is the stochastic nature of the process due to the state 
space discretization. This is because the same action from a certain state may 
cause different successor states. Starting the process with high learning rate, 
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Figure 4: Influence of the exploration rate €. Learning rate & = 0.2 is kept constant. 


the Q-table is quickly updated, however gets stuck in local optima. Given a 
smaller learning rate, the Q-Value of that action at that state changes more 
slowly thus averaging all possible transitions. This in turn leads, though more 
slowly, to amore complete overall picture of the swing-up process. 


This in turn motivates a modification of the RL algorithm by realizing the 
learning rate @ as a linearly decaying function over time. In the remainder, 
a decays from 0.5 to 0.01 in the first 25000 episodes. 


4.2 Exploration vs exploitation 


It is well known that learning following RL strategies depend crucially on 
the balancing of exploration of the state action space as well as exploiting 
the results already obtained. Fig. 4 depicts the influence of the exploration 
rate € on the learning process. In general terms, while high exploration rates 
are theoretically advantageous since each state-action pair is visited often, and 
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Figure 5: Visualisation of the Q-Table. The carts horizontal position and velocity are omitted. Dark 
grey corresponds to the action u = —1, light grey tou = +1. 


thus convergence to the optimal policy is guaranteed, in practice however high 
exploration rates are numerically prohibitive. 


Since learning not only depends on rigorous exploration of the state-action 
space, rather it is significantly boosted when a first success is achieved. So, 
exploitation is relevant so as to focus on most promising trajectories leading to 
an early success which can be exploited thereafter. This relation is in general 
very difficult to translate in reasonable € parameter values. In our case, a 
exploration rate of 0.5 is well balanced. 


4.3 Visualisation of the policy 


The resulting Q-table with the greedy action is shown in Fig. 5. Clearly visible 
are the expected symmetrical properties. Each pixel corresponds to a discrete 
state. Omitted are the horizontal position and velocity of the cart for simpli- 
city. 


5 Discussion 


The here presented RL algorithm is easily applied to four dimensional non- 
linear dynamic system. We have shown applicability of the algorithm for 
swing-up process of this cart-pole. The approach can be easily adapted so as to 
treat a variety of stabilisation goals such as keeping the cart-pole at the upper, 
instable stationary point. The presented temporal difference algorithm can be 
applied for this purpose directly, however a functional approach (i.e. learning 
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the control parameters of a state feedback using reinforcement learning) may 
be computationally more advantageous. 


The presented approach is directly applicable to the hardware only case. Ho- 
wever, in this naive setting with not prior knowledge, a rigorous exploration 
of the state space is mandatory. Thus, a simulation model is an advantage to 
speed up the training significantly. Prior knowledge can be encoded e.g. within 
the reward function or by adding additional constraints. 


Regarding the computational tractability, discretizing the state space has to 
be considered carefully. While the available computational power limits the 
overall number of discrete states, it is advantageous to distribute the discrete 
states by taking the studied process into consideration. It is advantageous to 
constrain the state space to an meaningful observable area, so as to subsume 
large angular and horizontal velocities into a few (forbidden) states and thus 
reducing complexity. 


Furthermore, it is possible to consider an adaptive dicretization scheme, star- 
ting with only few discrete states and subsequently splitting those states gi- 
ven an appropriate measure based on the transition probabilities as obtained 
from the reinforcement learning algorithm. Further research will apply the 
RL framework to different stabilization problems occurring in production, e.g. 
adaptive picking of objects with a robot arm, or placing objects in optimal 
fashion. 
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Kurzfassung 


In diesem Kurzbeitrag wird eine MATLAB-Toolbox zum Testsignalentwurf fiir 
Standardtestsignale fiir die lineare und nichtlineare Systemidentifikation vorge- 
stellt. Für die Identifikation dynamischer Systeme sind nicht nur die Parameter- 
schätzverfahren und die ausgewählten Modellansätze relevant für die Qualität 
der geschätzten Modelle, sondern auch die Eigenschaften der zur Identifikation 
verwendeten Daten, weshalb zielgerichtete Methoden zur Zielsystemanregung 
von Interesse sind. Im Beitrag werden die Funktionen beschrieben und an 
Beispielen demonstriert. 


1 Einführung 


Neben Schätzverfahren und Modellansätzen spielen auch die Eigenschaften 
der zur Identifikation verwendeten Daten eine große Rolle. Eine wichtige Un- 
terscheidung beim Testsignalentwurf ist in prozessmodellfreie und prozess- 
modellbasierte Ansätze. Prozessmodellfreie Verfahren verwenden Signalkenn- 
werte und Vorwissen, um Testsignale zu entwerfen. Sie setzen wenig Vorwis- 
sen über das Zielsystem oder geeignete Modellansätze voraus. Sie dienen eben- 
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falls als Vorstufe fiir modellbasierte Entwiirfe, die strukturelles Modellwissen 
und gef. eine Initialparametrierung voraussetzen. Hierbei spielen insbesondere 
die Homgenisierungsverfahren eine große Rolle. 


Prozessmodellbasierte Ansätze verwenden wenigstens Informationen über die 
Modellstruktur und im nichtlinearen Fall auch die zu schätzenden Modellpa- 
rameter selbst. Die Entwurfsmethoden sind für lokal-affine Takagi-Sugeno- 
Modelle implementiert, Können aber prinzipiell für weitere Modellklassen ver- 
wendet werden. Die Wahl einer Modellklasse lässt Spezifikationen der Ver- 
fahrens zu. Zwei Typen modellbasierter Verfahren sind in der Toolbox im- 
plementiert: Reduktion der Parameterunsicherheit basierend auf der Fisher- 
Informationsmatrix (FIM) und die Ausgangshomogenisierung mittels Vorsteue- 
rung. Für alle Entwürfe muss eine Abtastzeit Ta vorgegeben werden, welche 
sich aus der Anwendung ergibt. Abbruchkriterien für die einzelnen Verfahren 
sind in den entsprechenden Veröffentlichungen beschrieben. Die Toolbox lie- 
fert zeitdiskrete Signale für die Identifikation von Single-Input-Single-Output- 
(SISO-)Systemen. Sie wird frei zugänglich auf der Homepage (http://www.uni- 
kassel.de/fb15/mrt) des Fachgebietes Mess- und Regelungstechnik (MRT) be- 
reitgestellt werden. 


Die Toolbox wurde in der MATLAB Version: 9.5.0.944444 (R2018b) imple- 
mentiert und getestet. Die Partikelschwarmoptimierung stammt aus der Global 
Optimization Toolbox Version: 4.0 (R2018b). Die Toolbox wird frei zum Dow- 
nload zur Verfügung gestellt. 


2 Prozessmodellfreie Entwürfe 


Für den prozessmodellfreien Entwurf sind Standardtestsignale implementiert: 
Multisinus-, Multistufen- und Chirp-Signale. Auf Rauschsignale wurde ver- 
zichtet, da bei der Parametrierung kaum Vorwissen eingebracht werden kann, 
wie dem Frequenzraster oder sinnvollen Stufenlängen. Diese typischen Anre- 
gungssignale wurden auch in [1] vergleichend gegenübergestellt. Dieser Teil 
der Toolbox basiert darauf, dass ein cell-array mit den notwendigen Rahmen- 
bedingungen an den Testsignalgenerator übergeben wird. Diese Bedingungen 
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Tabelle 1: Einstellparameter/Entwurfsoptionen Multisinussignal 


Einstellung Bedeutung/Optionen 
Fin, fmax Frequenzband 
T max Maximale Experimentdauer 
Frequenzauswahl Alle, Primzahlen, Jede n-te 
Phasenauswahl Zufallsphasen, Schroeder-Phasen 
Homogenisierung mit hom/normal 


werden in den folgenden Unterabschnitten vorgestellt. Die prozessmodellfrei- 
en Entwurfsverfahren zielen auf die Einstellung der Signalwerteverteilung und 
die Kompaktheit der Signale ab. Daher können alle prozessmodellfrei entwor- 
fenen Signale zum Abschluss des Entwurfs auf den zulässigen Wertebereich 
skaliert werden 


2.1 Multisinussignale 


Bei Multisinussignalen sind die Parameter zur Einstellung der Signaleigen- 
schaften die Amplituden und Phasen der im Signal enthaltenen Frequenzan- 
teile, welche zuvor ausgewählt werden. Bevor Verfahren verwendet werden 
können, um über die Einstellparameter gezielt Signaleigenschaften zu errei- 
chen, müssen noch Rahmenbedingungen (Tabelle 1) festgelegt werden, welche 
die Einstellparameter beeinflussen. Bis auf die Option zur Homogenisierung 
dient alles der Erstellung eines Initialsignals. Hierbei ist zu berücksichtigen, 
dass das Frequenzband mittels Vorwissen ausgewählt werden sollte. Häufig 
ist die Abtastzeit aus der vorgesehenen Anwendung vorgegeben. Durch die 
Wahl der maximalen Experimentdauer wird die Grundfrequenz und damit das 
Frequenzraster festgelegt. Bei der Auswahl der aktiven Frequenzen sollte dar- 
auf geachtet werden, dass in einem Frequenzband nur eine begrenzte Anzahl 
aktiver Frequenzen vorhanden ist, da sonst die Signalenergie je Frequenzanteil 
reduziert wird. Vertiefende Erläuterungen finden sich in [2]. 
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Tabelle 2: Einstellparameter/Entwurfsoptionen Multistufensignal 


Einstellung Bedeutung/Optionen 
Nstep Stufenanzahl 
liins Lax Wertebereich der Stufenlängen 
Homogenisierung mit hom/normal 


Das Homogenisierungsverfahren passt die Phasen so an, dass der Wertebereich 
des Eingangssignals möglichst gleichformig abgedeckt wird. Für eine genaue 
Beschreibung des Homogenisierungsverfahrens wird auf [3] verwiesen. 


2.2 Multistufensignale 


Bei Multistufensignalen sind die direkten Einstellparameter die Höhen und 
Längen der einzelnen Stufen, nachdem die Rahmenbedingungen aus Tabelle 
2 festgelegt wurden. Die exakte Signallänge ist kein Entwurfsparameter. Sie 
wird indirekt über die Anzahl der Stufen und die Länge der Stufen eingestellt. 
Die Länge der Stufen sollte so gewählt werden, dass die langsamste Zeitkon- 
stante, die zu finden ist, durch eine Stufe abbildbar ist. Erfahrungsgemäß ist bei 
einer hohen Stufenanzahl die maximale Experimentdauer in etwa Tmax = step’ 
3 (Imin + max). Die absoluten Stufenhöhen werden an dieser Stelle nicht expli- 
zit eingestellt, sondern ergeben sich durch die Skalierung des Testsignals auf 
den zulässigen Signalwertebreich, welche im Anschluss durchgeführt werden 
kann. Die Entwurfsverfahren zielen auf die Signalverteilung selbst ab. Für die 
Grundlagen der Erstellung von Multistufensignalen aus Pseudozufallsbinärse- 
quenzen wird auf [4] verwiesen. Für die nicht-homogenisierte Version werden 
gleichverteilte Zufallszahlen für die Stufenhöhen und -längen verwendet, da 
sich keine Vorteile bei der Verwendung von bspw. Sobol-Sequenzen gezeigt 
haben. Beim homogenisierten Multistufensignal wird eine Scheitelfaktoropti- 
mierung eingesetzt. Dabei ist der Scheitelfaktor des Signals u definiert als: 


max |u| u'u 
= length (u) 


(1) 


= mse (u)’ 
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Tabelle 3: Einstellparameter/Entwurfsoptionen Chirp-Signal 


Einstellung Bedeutung/Optionen 
Tinax Maximale Experimentdauer 
Homogenisierung mit hom/normal 


Da ohne weitere Nebenbedingungen ein Binärsignal entsteht, wurden Vertei- 
lungsmaße über Nebenbedingungen in das Optimierungsproblem integriert. 
Für eine detailliertere Beschreibung wird auf [3] verwiesen. 


2.3 Chirp-Signale 


Chirp-Signale sind eine schnelle Möglichkeit ein Signal zu entwerfen, welches 
einen Frequenzbereich abdeckt und dabei annähernd eine Gleichverteilung der 
Signalwerte aufweist. Tabelle 3 zeigt die Einstellmöglichkeiten dieses Signal- 
typs. Die Eigenschaften eines solchen Signals werden in [3] beschrieben. 


2.4 Beispiel 


Exemplarisch wird der Entwurf eines Multisinustestsignals vorgestellt. Zu- 
nächst muss ein cell-array mit den Rahmenbedingungen erstellt werden: 
info={’msine’,[0.1 2], 50, 0.01, ’primes’, ’random’, ’normal’}. 

So wird ein zeitdiskretes Multisinussingal erzeugt, welches den Frequenzbe- 
reich von 0,1 — 2 Hz abdeckt, eine maximale Experimentdauer von Tmax = 50s 
bei einer Abtastzeit von Ta = 0,01s. Es werden nur Primzahlenvielfache der 
Grundfrequenz verwendet. Die Phasenfestlegung ist zufällig und es wird keine 
anschließende Homogenisierung durchgeführt. Durch den Aufruf der Funkti- 
on für den prozessmodellfreien Testsignalentwurf entsteht das Signal mit den 
folgenden Angaben gemäß Bild 1. Letzteres zeigt die Ausgabe der integrierten 
Anzeigefunktion, welche neben dem Scheitelfaktor auch einen Kennwert für 
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Zeitbereich: TMAX=50 s, c,=1.8142 
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Bild 1: Toolboxanzeige fiir exemplarischen prozessmodellfreien Multisinustestsignalentwurf 


die Signalenergie 
E=U'U' (2) 


angibt, welcher im Frequenzbereich mit der diskreten Fouriertransformierten 
U = DFT (u) bestimmt wird. Hierbei gibt * die komplexe Konjugation an. 


3 Prozessmodellbasierte Entwurfe 


Prozessmodellbasierte Entwürfe benötigen vollständige Initialmodelle, beste- 
hend aus Strukturinformationen und Modellparametern, um durchgeführt wer- 
den zu können. Sowohl der FIM-basierte Entwurf in Abschnitt 3.1 und die Aus- 
gangshomogenisierung in Abschnitt 3.2 sind für lokal-affine Takagi-Sugeno- 
Modelle implementiert, welche Zugehörigkeitsfunktionen aus dem Fuzzy-c- 
means-(FCM-)Clusteralgorithmus mit euklidischer Abstandsnorm verwenden. 
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Tabelle 4: Ubergabe eines TS-Modells an die Toolbox 


Übergabegröße Bedeutung 
Strukturinformationen 

c Anzahl der Teilmodelle 

n,m Verzögerungen der Ein- und Ausgangsgröße 
T: Abbildung des Regressionsvektors 

auf die Schedulingvariable 
v Unschärfeparameter 
Modellparameter 

oe Gesamtparametervektor 

®Lm lokale Modellparameter 

OMF Partitionsparameter (Prototypen v;) 


Für die Verwendung dieses Teils der Toolbox müssen Informationen über das 
TS-Modell übergeben werden. Zum einen wird ein MATLAB-struct, welches die 
Strukturinformationen (Anzahl der Teilmodelle c, maximale Lags der Ein- und 
Ausgänge n und m, Unschärfeparameter v und ggf. eine Funktionsvorschrift 
zur Berechnung der Schedulingvariablen aus dem Regressionsvektor, welche 
hier als lineare Abbildung über die Matrix T, implementiert ist) enthält und 
zum anderen der Gesamtmodellparametervektor übergeben. Da zur Bestim- 
mung der FIM die analytischen Ableitungen der FBF nach den Modellparame- 
tern notwendig sind, muss dies bei einem Austausch der FBF berücksichtigt 
werden. Tabelle 4 zeigt die Übergabegrößen an den prozessmodellbasierten 
Teil der Toolbox. 


3.1 Testsignalentwurf zur Reduktion der Unsicherheit 


Der Testsignalentwurf zur Unsicherheitsreduktion basiert auf der Minimierung 
skalarer Maße auf der Fisher-Informationsmatrix (FIM), was z.B. in [5] schon 
beschrieben wurde. Unter der Annahme normalverteilten Rauschens am Aus- 


Proc. 30. Workshop Computational Intelligence, Berlin, 26.-27.11.2020 151 


gang, dessen Realisierungen unabhängig voneinander sind, kann die FIM wie 
folgt angegeben werden: 


EES 3) 


Hierbei ist 0? die Varianz des Rauschens, N die Anzahl der Beobachtungen, 


$(k) der Modellausgabewert im k-ten Zeitschritt und © sind die Prozessmodell- 
parameter. In der Literatur wird die FIM häufig in Abhängigkeit der Modell- 
parameter angegeben, dies ist jedoch für die Realisierung für ein dynamisches 
Problem irreführend, denn die FIM hängt neben dem Testsignal auch von der 
Modellausgabe selbst ab, da vergangene Werte der Modellausgabe Teil der 
Regressionsvariablen sind. Dies macht die Verwendung eines internen Simu- 
lationsmodells nötig, welches in jedem Iterationsschritt ausgewertet werden 
kann. In der Toolbox wird das folgende Optimierungsproblem gelöst: 


Bopi = argmin BJ (I(Y(®,u(ß)),u(ß),®)) (4) 


Der Fettdruck von u und y bezeichnet hier, dass es sich um das gesamte Si- 
gnal handelt. B kennzeichnet hier die Signalmodellparameter, da das Problem 
fiir eine individuelle Optimierung aller Einzelwerte von u(k) nicht praktika- 
bel ist. Die Verwendung von Signalmodellen beim optimalen Testsignalent- 
wurf hat ebenfalls implizite Nebenbedingungen zur Folge. Die FIM-basierte 
Optimierung von Signalmodellparametern wird in [6, 7] fiir die verwendeten 
Takagi-Sugeno-Modelle beschrieben. Die Lösung des Problems wird mit dem 
in MATLAB implementierten Partikelschwarmalgorithmus [9] gelöst, welcher 
über die entsprechenden Ubergabeparameter angepasst werden kann. Tabelle 5 
zeigt die Entwurfsoptionen beim FIM-basierten Testsignalentwurf. Die Kom- 
binationen beim Testsignaltyp meint eine Addition eines Multisinussignals mit 
einem Multistufensignal. Da sich die Modellparameter der TS-Modelle in lo- 
kale Modellparameter und Partitionsparameter aufteilen lassen, sind in der 
Toolbox auch getrennte Optimierungen möglich, um Testsignale zu entwerfen, 
die eine präzise Schätzung der entsprechenden Parametergruppe ermöglicht. 
Bild 2 zeigt exemplarisch ein FIM-basiert entworfenes Stufensignal. 
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Tabelle 5: Entwurfsoptionen FIM-basierter Testsignalentwurf 


Einstellung Optionen 
Testsignaltyp Multisinus, Multistufen, Überlagerung 
Zielparameter Alle, lokale Modellparameter, 
Partitionsparameter 
FIM-Maß Determinante, Spur, skalierte Spur, 


Maximieren des kleinsten Eigenwerts, 
Empfindlichkeitssumme 


Zeitbereich, Tmax = 508s, cr = 1,5361 


0.5 | | I 


Zeit t ins 


Bild 2: Multistufensignal durch FIM-basierten Testsignalentwurf 


3.2 Ausgangshomogenisierung durch Vorsteuerung 


Bei der Identifikation nichtlinearer dynamischer Systeme ist davon auszuge- 
hen, dass das nichtlineare Verhalten hauptsächlich durch die vergangenen Wer- 
te der Ausgangsgröße bestimmt wird. Bei statischen Systemen führen raum- 
füllende Ansätze zu einer besseren Rasterung bei einem möglichst sparsa- 
men Umgang mit verfügbaren Messpunkten der Ausgangsgröße. Wird eine 
Homogenisierung der Eingangsgröße durchgeführt, wird außer Acht gelassen, 
dass dies nicht zwangsläufig zu einer Homogenisierung der Ausgangsgröße 
und damit zu einer besseren Rasterung des nichtlinearen Verhaltens führt. Aus 
diesem Grund enthält die Toolbox Methoden, welche eine flachheitsbasierte 
Vorsteuerung basierend auf Initialmodellen verwendet. Dazu wird zunächst ein 
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Ausgangssignal erzeugt, welches den Schedulingraum gleichmäßig abdeckt. 
Wird nur die Ausgangsgröße homogenisiert, können die prozessmodellfrei- 
en Testsignalentwurfsfunktionen verwendet werden. Wird der gesamte Sche- 
dulingraum betrachtet, müssen Abdeckungsmaße individuell für diesen an- 
gepasst werden, um eine geeignete Referenzausgangsgröße zu erhalten, wel- 
che vom System durch die entworfene Vorsteuerung erreicht werden soll. Ein 
n,-dimensionalen Schedulingraum wird dabei in Hyperquader zerlegt, in wel- 
chem die Datenpunkte gezählt werden. Wie sich die Verteilung im Schedulin- 
graum auf die Ausgangsgröße auswirkt, hängt von der individuellen Wahl der 
Schedulingvariable ab. Die Ausgangsgröße wird dann iterativ mit der MATLAB- 
PSO ermittelt, was aufgrund der geringen Komplexität keine hohen Rechen- 
kosten verursacht. Für eine geeignete Wahl der Referenzausgangsgröße müs- 
sen Rahmenbedingungen wie der zulässige Wertebereich und das relevante 
Frequenzband des Systems ermittelt werden. 


Das Verfahren basiert darauf, bei einem gegebenen Referenzsignal mit den 
gewünschten Eigenschaften mittels des TS-Initialmodells eine Vorsteuerung 
so zu entwerfen, dass das zu identifizierende System die Referenzausgangs- 
größe bestmöglich liefert. Bei der Umsetzung wird die lokale Modellstruktur 
der TS-Modelle ausgenutzt, indem lokale Steuerfunktionen u;(k) zu den lo- 
kalen Teilmodellen entworfen werden. Die u;(k) werden dann mit den Fuzzy- 
Basisfunktio-nen ®;(k) des TS-Initialmodells zwecks simulativer Bewertung 
und Optimierung überlagert. 


u) = 67%) -u;(K) (5) 
j=l 


Bei einem steuerbaren System kann der Entwurf mit einem virtuellen flachen 
Ausgang stets durchgeführt werden. Für weitere Informationen zu der Reali- 
sierung wird auf [8] verwiesen. Bild 3 zeigt die Auswirkung des Entwurfs. 
Im oberen Plot ist das entstandene Testsignal zu sehen. Im unteren Plot zum 
einen das Referenzsignal (gepunktet), die Antwort des Fuzzy-Initialmodells 
(schwarz), welches fiir den Vorsteuerungsentwurf verwendet wurde, sowie die 
tatsächliche Systemantwort (grau). Das Referenzsignal wurde so entworfen, 
dass der zugehörige Schedulingraum möglichst gleichmäßig abgedeckt ist. 
Auch wenn die Vorsteuerung nicht exakt ist, was durch die Überlagerung der 
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Optimiertes Testsignal 


© 


Modell- und Systemantworten 


Zeit t in s 


Bild 3: Testsignal und Ausgansgrößen, Referenzsignal (gepunktet), Antwort Fuzzy-Modell 
(schwarz), Systemantwort (grau) 


lokalen Steuerfunktionen bedingt ist, sind sich die Signale dennoch sehr ähn- 
lich, so dass am System Daten erhoben werden konnten, die eine gleichmäßige 
Abdeckung des Schedulingraums aufweisen. 


4 Zusammenfassung und Ausblick 


In diesem Beitrag wurden kurz die Funktionen einer MATLAB-Testsignalent- 
wurfstoolbox vorgestellt. Hierbei ist der prozessmodellfreie Teil für jeden Mo- 
delltyp verwendbar und nicht auf Takagi-Sugeno-Modelle beschränkt. 


Die prozessmodellbasierten Entwürfe sind für TS-Modelle implementiert. Die 
Toolbox ist frei verfügbar. Sie enthält für alle Kategorien Demonstrationsbei- 
spiele, sowie eine detaillierte Anleitung ihrer einzelnen Elemente. Bei der Aus- 
gangshomogenisierung kann der Effekt direkt am Signal abgelesen werden. 
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Die Wirkung der Methoden wurde an verschiedenen Fallbeispielen untersucht 
und in [3, 6, 8] veröffentlicht. 
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1 Einführung 


Die Systemidentifikation versucht anhand von Beobachtungsdaten Prozesse 
mit Hilfe von Modellen bestmöglich zu beschreiben. Diese Modelle ermög- 
lichen es nicht nur das Verhalten eines komplexen Systems abzubilden, son- 
dern dieses auch in gewissem Maße vorherzusagen. Diese Fähigkeit findet 
besonders in Bereichen der Automatisierungs- und Regelungstechnik ihren 
Nutzen, vor allem in Teilgebieten wie beispielsweise der modellbasierten Pro- 
zesssteuerung und Prozessoptimierung. Der Schlüssel bei jeder Modellierungs- 
aufgabe ist es, eine geeignete Modellstruktur zur Abbildung des untersuchten 
Prozesses zu wählen. Hier finden bei komplexen Prozessen besonders nichtli- 
neare dynamische Ansätze ihre Anwendung. Die Verfahren zur nichtlinearen 
dynamischen Modellbildung können allgemein in zwei Kategorien unterteilt 
werden. Der erste Ansatz verwendet eine externe Dynamik zur Schätzung der 
Modelle, wobei der aktuelle Modellausgang y auf Basis verzögerter Ein- und 
Ausgangsgrößen geschätzt wird. In der Systemidentifikation finden in diesem 
Bereich häufig Modelle mit einer NARX-Struktur (Nichtlinear autoregressiv 
mit exogenem Eingang (Nonlinear ARX)) ihre Anwendung. Verfahren mit 
einer internen Dynamik, die der zweiten Kategorie zugeordnet werden können, 
haben keine externe Kopplung verzögerter Ein- und Ausgänge. Hier werden die 
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internen Zustände innerhalb des Modells lokal zurückgegeben. Häufige Ver- 
treter dieser Kategorie sind NLSS (general NonLinear State-Space)-Modelle. 
Auch der Einsatz von rekurrenten neuronalen Netzen in diesem Bereich ist 
ein viel diskutiertes Thema in der heutigen Forschung. Es wurden bereits ers- 
te Verbindungen und Gemeinsamkeiten dieser Verfahren aufgezeigt und eine 
gewisse Vergleichbarkeit in der Modellgenauigkeit nachgewiesen [1], [2]. 


In dieser Arbeit werden die Untersuchungen hinsichtlich der Vergleichbar- 
keit der beiden Ansätze aufgegriffen und zwei Methoden zur Systemidenti- 
fikation bezüglich ausgewählter Modellcharakteristika miteinander verglichen. 
Zum einen kommt ein lokal-affines Zustandsraummodell zum Einsatz, welches 
das Systemverhalten mit Hilfe von lokal-linearen Modellen abbildet, die auf 
Basis aufgenommener Prozessdaten geschätzt werden. Dies erfolgt mit dem 
Programmpaket DYLAMOT (DYnamic Local Affine MOdeling Toolbox). Für 
die lokal linearen Modelle wird dabei eine NARX-Struktur gewählt, wobei die 
Parameter anhand einer Ausgangsfehlerschätzung bestimmt werden. Zum an- 
deren werden für die Systemidentifikation rekurrente neuronale Netze, genauer 
„Long short-term memory“-Netze (LSTM), verwendet. Diese besitzen Rück- 
kopplungen zwischen Neuronen verschiedener Schichten, sodass Prozessinfor- 
mationen über einen Zeitraum erhalten bleiben. Bei einem möglichen Einsatz 
der Verfahren an realen Prozessen ergeben sich verschiedene Herausforderun- 
gen. Einerseits besteht die Möglichkeit, dass Prozesse außerhalb ihres üblichen 
Einsatzbereichs betrieben werden, was ein gewisses Maß an Extrapolationsfä- 
higkeit in den verwendeten Modellen voraussetzt. Andererseits wirken auf rea- 
le Prozesse häufig Störeinflüsse, sodass es sinnvoll ist, die Empfindlichkeit der 
eingesetzten Modelle beispielsweise gegenüber dem Einfluss von Rauschen zu 
untersuchen. 


Nach einer kurzen Vorstellung der zwei Methoden muss in einem ersten Schritt 
eine ausreichende Modellgüte mit beiden Verfahren erzielt werden. Daraufhin 
werden in einem nächsten Schritt die verwendeten Modelle hinsichtlich ihrer 
Extrapolationsfähigkeit und ihrer Rauschempfindlichkeit näher betrachtet. Die 
Untersuchungen werden zunächst an einem zuvor definierten Testprozess vor- 
genommen. Folgend werden ausgewählte Aspekte zusätzlich an einem realen 
nichtlinearen Mehrgrößensystem untersucht. 


160 Proc. 30. Workshop Computational Intelligence, Berlin, 26.-27.11.2020 


2 Modellbildung 


In diesem Beitrag sollen dynamische Prozesse, die mit Hilfe nichtlineare Dif- 
ferentialgleichungen in der Form x(t) = f(t,x(t),u(t)) und y(t) = h(x(t),u(t)) 
abgebildet werden können, durch datenbasierte Ansätze dargestellt werden. 
Dabei sollen die funktionalen Zusammenhänge zwischen f und h untersucht 
werden. Zur Modellierung der nichtlinearen dynamischen Systeme werden lo- 
kal - affine Zustandsraummodelle sowie tiefe rekurrente neuronale Netze ein- 
gesetzt. Beide Verfahren werden im Folgenden kurz vorgestellt. 


2.1 Tiefe rekurrente neuronale Netze 


In einem Feedforward-Netzwerk werden die Informationen auf einem direkten 
Weg durch das Netz geleitet, d.h. von der Eingabeschicht durch die einzel- 
nen verdeckten Netzwerkschichten (hidden layers) zur Ausgabeschicht. Dabei 
berücksichtigen die Neuronen (hidden units) in den jeweiligen Schichten aus- 
schließlich den aktuellen Netzeingang. Ein klassisches einschichtiges Feedfor- 
ward-Netzwerk stellt eine Funktionsabbildung von einem Eingang x zu einem 
Ausgang ¥ (skalare Größen) dar und wird für die Netzwerkschicht j wie folgt 
beschrieben: 

$= È nola E+B): a) 

j= 

Hierbei sind die Parameter & und ß an die Dimension von x angepasste Grö- 
Ben. Die Variablen a, B und y stellen die Modellparameter des Netzes dar mit 
0 = {a;,B ave |. Als Aktivierungsfunktion wird eine Sigmoidfunktion o 


herangezogen [3]: 
1 


Bere ey a 


y=o() 


Der Ausdruck o(@;: (x + Bj) beschreibt in diesem Fall die hidden units, die 
zusammen die verdeckte Netzwerkschicht (hidden layer) repräsentieren [1]. 


Feedforward-Netzwerke können Informationen nur von einem Zeitschritt in 
den nächsten übertragen. Um Informationen über einen längeren Zeitraum bei- 
zubehalten, bieten sich rekurrente neuronale Netze (RNNs) an. Eine effizi- 
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ente Art von RNNs sind Long-Short Term Memory (LSTM)-Netze, die auf- 
grund ihrer inneren Struktur Informationen auch tiber einen langeren Zeitraum 
speichern können [4]. Ein vollständiges LSTM-Netzwerk besteht aus einer 
oder mehreren verketteten LSTM-Schichten und einer abschließenden voll- 
ständig verketteten linearen Netzwerkschicht. Die Eigenschaft, Informationen 
über einen längeren Zeitraum zu speichern, wird durch zwei Arten von internen 
Zuständen bestärkt sowie der Möglichkeit, den Informationsfluss zu und aus 
diesen Zuständen zu steuern. Als innere Zustände einer LSTM-Schicht zum 
Zeitpunkt r dienen der verdeckte Zustand h,, auch hidden state genannt, und 
der Zellstatus c;, der für das Langzeitgedächtnis des Netzwerks verantwortlich 
ist. Der hidden state sowie der Zellstatus repräsentieren dabei sowohl den 
Ausgang einer LSTM-Schicht zum Zeitpunkt t — 1 als auch den Eingang der 
folgenden LSTM-Schicht zum Zeitpunkt t. Auf Basis der Eingangsgrößen x; 
und des hidden states h,_ı aus der vorherigen Netzwerkschicht wird der Infor- 
mationsfluss in und aus dem Zellstatus über sogenannte gates gesteuert. Hier- 
bei handelt es sich um ein forget gate f,, um ein input gate i,, einem cell gate g, 
und einem output gate 0;. Das forget gate f, entscheidet, welche Informationen 
aus dem Zellstatus entfernt werden sollen. Hierfür wird eine Sigmoidfunktion 
o (Gleichung (2)) herangezogen, deren Ausgabe zwischen 0 (vollständiges 
Vergessen) und 1 (Information vollständig beibehalten) liegt. Das input gate 
i, entscheidet, ebenfalls auf Basis einer Sigmoidfunktion, welche Werte dem 
Zellstatus hinzugefügt werden sollen. Dafür wird durch das cell gate g, mithilfe 
einer tanh-Funktion eine Kandidatenliste erstellt. Die Kombination aus inpur 
gate und cell gate führt zu einer Aktualisierung des Zellstatus ¢;. Zum Schluss 
gibt das output gate o; auf Basis einer Sigmoidfunktion an, welche Informa- 
tionen aus dem aktuellen Zellstatus c; in den Ausgang der Netzwerkschicht 
fließen. 


Die gates lassen sich wie folgt definieren: 


f, = O(WyeX; + bye + Worhr-ı + bnr) (3) 

ip = o(W X + bxi + Wuihr- 1 + bhi) (4) 

g = tanh(Wyox; + bxg + Wagh;—1 + bng) (5) 
0; = O(WyoX; + Dxo + Wholty—1 + bho). (6) 
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Hierbei repräsentieren Wy, Wxi, Wxg und Wx. die Gewichtsmatrizen des Ein- 
gangs x, und die Matrizen Wh, Whi, Whg, Who die Gewichtsmatrizen des 
vorherigen hidden states fiir das jeweilige gate. Dariiber hinaus spiegeln die 
Vektoren bx¢, bxi, bxg und bxo den Bias am Eingang und byr, bhi, Dag, Pho 
den Bias im hidden state des jeweiligen gates wider. Die Gewichtsmatrizen 
und Bias-Vektoren enthalten dabei die Parameter der LSTM-Schichten. Zur 
Berechnung des Zellstatus c; und des aktuelle hidden states h; werden die gates 
wie folgt verwendet: 


c =f o c1 +i 08; (7) 
h; = 0; 0 tanh(c,). (8) 


Zur Aktualisierung des Zellstatus c, erfolgt eine Addition der Hadamard - 
Produkte (0) aus dem forget gate f, und dem vorherigen Zellstatus c,_| sowie 
aus dem input gate i, und dem cell gate g,. Der hidden state h, kann folgend 
auf Basis von o; und c; berechnet werden. Abschließend bildet der Ausgang 
der letzten LSTM-Schicht den Eingang für eine vollständig verkettete lineare 
Netzwerkschicht, die den finalen Netzausgang y; ausgibt. Eine Übersicht über 
den Aufbau von LSTM-Schichten sowie deren Verkettung kann [5] entnom- 
men werden. 


Ein weiterer herauszustellender Fakt im Kontext von LSTM-Netzwerken ist 
deren strukturelle Ähnlichkeit zu den in der Systemidentifikation etablierten 
NLSS-Modellen. Beide Strukturen können unter gewissen Annahmen in die 
jeweilig andere überführt werden. Detaillierte Untersuchungen zur Vergleich- 
barkeit von NLSS-Modellen und LSTM-Netzen können [1] und [2] entnom- 
men werden. 


2.2 LOLIMOT 


Eine weitere Möglichkeit der datenbasierten Modellbildung stellt der LOLI- 
MOT (LOcal LInear MOdel Tree) - Algorithmus dar. Dieser gehört zu den 
inkrementellen heuristischen Konstruktionsalgorithmen und bildet nichtlineare 
Prozesse durch Überlagerung von mehreren lokal-affinen Teilmodellen ab, was 
den Vorteil einer leichten Interpretierbarkeit bietet [9]. 
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Der Modellausgang eines lokal-affinen Modells wird wie folgt definiert: 


Me 


lI 
Fat 


$=) (wio twix +... + WinXn) ®;(z). (9) 


l 
Der Modellausgang mit M affinen Teilmodellen ergibt sich aus der Summe der 
einzelnen Teilmodelle, gewichtet mit einer Aktivierungsfunktion ®;(z). Dabei 
wird jedes Teilmodell i durch n Eingangsgrößen x = [xı x2 ... Xn]? und n+1 
Parametern w; = [wio wi ... Win]? bestimmt. Die Aktivierung hängt von dem 
Eingangsvektor der Aktivierungsfunktionen z = [z1 z2 ... zu" ab, welcher dem 
Arbeitspunkt des lokal-affinen Modelles entspricht. Wie x wird auch z aus den 
insgesamt zur Verfügung stehenden Eingangsgrößen bestimmt [7]. Der Gültig- 
keitsbereich der Teilmodelle wird durch eine normierte Aktivierungsfunktion 
beschrieben, welche den Ausgang eines jeden Teilmodells in Abhängigkeit des 
z-Regressors gewichtet und wie folgt definiert ist: 


Ki(z) 
een en 
Hier spiegelt u;(z) den Grad der Zugehörigkeit zu den Giiltigkeitsbereichen 
wider, wobei die Summe der Aktivierungsfunktionen für eine sinnvolle Inter- 
pretation gleich Eins sein muss: a ®;(z) = 1. Um eine einfach zu interpretie- 
rende Modellstruktur sowie einen glatten Ubergang zwischen den Teilmodellen 
zu erzeugen, werden als Zugehörigkeitsfunktion Gaußglocken verwendet 


wobei c;; das Zentrum und o;; die Standardabweichung der Gaußglocke des 
i-ten Teilmodells in der j-ten Dimension des z-Regressors sind. Die Größen c;; 
und o;; ergeben sich dabei aus der Position und Größe der jeweiligen Gül- 
tigkeitsbereiche der Teilmodelle im Eingangsraum [7]. LOLIMOT zeichnet 
sich durch eine achsen-orthogonale Partitionierung des Eingangsraums aus, 
was eine einfache, schnelle und vor allem transparente Strukturoptimierung 
ermöglicht. Dabei wird schrittweise das lokal schlechteste Teilmodell durch 
eine achsen-orthogonale, mittige Teilung in zwei neue Teilmodelle aufgeteilt. 
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ls d(k) 
A = =| Prozess no O, u(k) oe > Prozess yw) 


y y y 
B@ 
B A = 
(9) @ AG 
7 7 e(k) 3 7 e(k) 
(a) Gleichungsfehler-Anordnung (b) Ausgangsfehler-Anordnung 


Bild 1: Darstellung der Gleichungsfehler-Anordnung (NARX) für die Parameterschätzung in a) 
und der Ausgangsfehler-Anordnung für die Strukturoptimierung in b). 


Das lokal schlechteste Teilmodell kann beispielsweise anhand der lokalen Ver- 


lustfunktion V: 
N 


Vi1okat = I (v(k) - 9(k)) D: (z(k)) (12) 
k=1 

beurteilt werden, welche den quadrierten Modellfehler, gewichtet mit der je- 
weiligen Aktivierungsfunktion, aufsummiert. Dabei wird aus allen möglichen 
Teilungen diejenige ausgewählt, die zu dem geringsten globalen Fehler führt. 
Ein großer Vorteil der in Gleichung (9) beschriebenen Struktur kommt bei 
der Parameterschätzung zum Tragen. Da der Modellausgang linear in seinen 
Parametern ist, können lineare Verfahren, wie die Methode der gewichteten 
kleinsten Fehlerquadrate, eingesetzt werden. 


LOLIMOT wird im Folgenden zur Modellierung eines nichtlinearen dynami- 
schen Systemmodells verwendet, sodass die bisher betrachtete Modellstruktur 
um eine Dynamik erweitert werden muss. Dazu wird eine NARX-Struktur zur 
Schätzung der Parameter verwendet, wobei ein Ein-Schritt-Prädiktionsfehler 
basierend auf einer Gleichungsfehler-Anordnung (Bild la) minimiert wird. 
Hierbei stellt B(q) das Zählerpolynom und A (q) das Nennerpolynom der Über- 
tragungsfunktion des Modells dar, wobei q mit q~!x(k) = x(k — 1) den Zeit- 
verzögerungsoperator darstellt. Des Weiteren repräsentiert d(k) eine mögliche 
Störung [7]. Der Modellausgang des dynamischen lokal-affinen Gesamtmo- 
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dells kann wie folgt beschrieben werden: 


M 
Sk) = F (by u(k—1) +... 4+ Binu(k— 
(k) LI 1u(k— 1) u(k —n) is 


ajy(k—1) —...—ainy(k—n) + €;)®i(2(k)). 


Dabei wird exemplarisch ein SISO-System (Single-Input-Single-Output) n-ter 
Ordnung zugrunde gelegt. Es werden verzögerte Eingangsgrößen mit den Pa- 
rametern a; und verzögerte Ausgangsgrößen mit den Parametern b; sowie ein 
Offset & als Regressoren verwendet. Die Minimierung des Gleichungsfehlers 
hinsichtlich der Parameterschätzung hat den Vorteil, dass auch bei dynami- 
schen Systemen lineare Schätzverfahren, wie die Methode der kleinsten Feh- 
lerquadrate, anwendbar sind, da Linearität in den Parametern gegeben ist. Die 
Strukturoptimierung basiert dagegen auf dem Ausgangsfehler (NOE (Nichtli- 
nearer Ausgangsfehler)-Struktur), Bild 1b. Dies hat den Vorteil, dass die durch 
die Rückkopplung bedingte Fehlerfortpflanzung, aufgrund der Parameteropti- 
mierung in der Gleichungsfehler-Anordnung, in der Strukturoptimierung ver- 
mieden werden kann [8]. Die Gleichung (13) Kann auch in einer parameterver- 
änderlichen Form dargestellt werden: 


n n 


Sk) = X bi(a(k) u(k — i) — YP ai(a(k))9(k—i) +E (2(k)). aA 


i=l el 


Das Software-Paket DYLAMOT (DYnamic Local Affine MOdeling Toolbox) 
beinhaltet Erweiterungen des LOLIMOT-Verfahrens zur Identifikation nichtli- 
nearer dynamischer Prozesse mit Hilfe von lokal-affinen Modellen, wodurch 
eine bessere Modellgüte erzielt werden kann. Dabei verwendet DYLAMOT 
zur Partitionierung des Eingangsraums den heuristischen LOLIMOT - Kon- 
struktionsalgorithmus. Eine der implementierten Erweiterungen ist die Verzö- 
gerung des arbeitspunktbeschreibenden z-Regressors entsprechend der jewei- 
ligen Modelleingangsgrößen. Hierbei erfolgt die Änderung der Modellpara- 
meter in Abhängigkeit von z(k) zum Zeitpunkt k nicht gleichzeitig, sondern 
jeweils verzögert mit der Verzögerung i der dazugehörigen Eingangsgrößen 
u(k-i) oder $(k — i) [7]. Ein großer Vorteil der Verzögerung der z-Regressoren 
ist ein stark verbessertes Eingangs- /Ausgangsverhalten im Vergleich zu einem 
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Modell mit gleichzeitiger Parameteränderung. Während die verzögerte Para- 
meteränderung das Eingangs- /Ausgangsverhalten verbessert, wird der Offset 
mit einer unabhängigen Zähler-Dynamik ausgestattet, wodurch besonders die 
Interpretierbarkeit der Modelle erhöht wird. Durch die Erweiterung werden 
für jedes Teilmodell Offset-Parameter mit verzögerter Parameteränderung ge- 
schätzt. Der Modellausgang mit verzögerter Parameteränderung und Offset- 
Zähler-Dynamik ergibt sich dann zu: 


5(k) = È biak- ))ulk-i)- Y alk- i)i) 
F (15) 
HL Eau) 


Durch die eingeführte Erweiterung ist der Offset im Modell abhängig von 
unterschiedlichen Zeitpunkten, wodurch die Flexibilität des Modells gesteigert 
wird. Die verzögerte Parameteränderung ist dabei für die Verwendung meh- 
rerer Offset-Parameter zwingend, da sonst alle Offset-Parameter zu einem zu- 
sammengefasst werden können [7]. Durch die vorgenommenen Erweiterungen 
der Modellstruktur ergibt sich, dass jedes Teilmodell in der Ausgangsfehler- 
Anordnung (Bild 1b) geschätzt wird. Hierdurch fließen die Parameter nichtli- 
near in den Ausgangsfehler ein, sodass nichtlineare Optimierungsverfahren zur 
Schätzung der Parameter herangezogen werden müssen [9]. 


Durch die getroffenen Erweiterungen der Modellstruktur lässt sich eine mini- 
male Zustandsraumrealisierung erzeugen, die in Beobachternormalform dar- 
gestellt werden kann [7]. Dies ermöglicht einen Vergleich der vorgestellten 
Modellstrukturen gemäß [2]. 


3 Anwendungsbeispiel - Testprozess 


Bei dem Vergleich der vorgestellten Methoden liegt der Fokus auf der zu errei- 
chenden Modellgüte. Des Weiteren werden Aspekte wie die Extrapolationsfä- 
higkeit sowie die Rauschempfindlichkeit der verwendeten Modelle untersucht. 
Der Vergleich findet zunächst an einem ausgewählten Testprozess statt. Dies 
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Bild 2: Strukturbild des verwendeten Testprozesses. 


erfolgt zum einen mit dem Programmpaket DYLAMOT, zum anderen unter 
Verwendung der Deep Learning Toolbox in MATLAB [10]. 


3.1 Systembeschreibung - Testprozess 


Die Struktur des Testprozesses kann Bild 2 entnommen werden. Bei dem Test- 
prozess handelt es sich um ein nichtlineares dynamisches MISO (Multiple 
Input, Single Output)-System, basierend auf einem Hammerstein- und einem 
Wiener Modell. Zudem wird der Ausgang ŷ mit einem weißen Rauschen d 
beaufschlagt. Zur Anregung des Prozesses werden APRB (Amplituden- mo- 
duliertes Pseudo Rausch Binär)-Signale verwendet und auf beide Eingänge uı 
und u2 aufgeschaltet. Um eine hohe Modellgüte zu erzielen, müssen dabei die 
entscheidenden Frequenz- und Amplitudenbereiche des Systems hinreichend 
angeregt werden. Hierfür werden APRB-Signale mit einer Länge von 2046 Se- 
kunden und einer Taktzeit von 2 Sekunden gewählt. Die Amplituden basieren 
dabei auf einer Gleichverteilung. Die statischen Nichtlinearitäten 


1 1 u 
f(u) ne) (16) 
gen) = arctan (ŭ2) + ñz (17) 
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werden für {(u1,u2) € R?| 0,1 < uj < 0,9 V je {1,2}} betrachtet. Die Über- 
tragungsfunktionen Gj (s) und G2(s) werden wie folgt definiert: 


1 
1 
a) 553 +665°+73s+12 (18) 
1 
= on 1 
6) 3 +252 +6s+5 en 


Bei den Untersuchungen am Testprozess werden N = 204600 Datenpunkte 
aufgenommen, von denen 70% zum Trainieren der Modelle und 30 % als 
Testdatensatz verwendet werden. Zur Beurteilung der jeweiligen Modellgüte 
wird die Abweichung zwischen dem gemessenen Wert y und dem vorherge- 
sagten Wert f bestimmt: e = y — y. Zudem wird der RMSE auf den Testdaten 
berechnet: 


(20) 


wobei Nr die Anzahl an Testdatenpunkten beschreibt. Für die Untersuchungen 
muss eine geeignete Auswahl der verschiedenen Hyperparameter gefunden 
werden. Für eine ausreichende Modellgüte wird in DYLAMOT zur Modell- 
bildung die Anzahl der lokal linearen Modelle zwischen 4 und 8 variiert. Des 
Weiteren wird die Ordnungszahl der lokalen Modelle zwischen 2 und 4 an- 
gepasst. Für das LSTM-Netzwerk wird die Anzahl der Neuronen, die Anzahl 
der verdeckten Schichten (hidden layers) und die Anzahl der Epochen für das 
Training so variiert, dass sich eine vergleichbare Modellgüte einstellt. 


3.2 Modellvergleich - Testprozess 


Mit beiden Verfahren kann bei den Untersuchungen eine hohe Modellgüte 
erreicht werden. Das beste Ergebnis wird in DYLAMOT mit einem lokal- 
linearen Modell, bestehend aus 7 Teilmodellen vierter Ordnung, erzielt. Bei 
dem LSTM-Netzwerk wird das beste Ergebnis unter der Verwendung einer 
LSTM-Schicht mit 128 Neuronen erzielt. Dabei wird beim Training eine ma- 
ximale Anzahl von 500 Epochen gewählt und zum Anlernen der ADAM (ad- 
aptive moment estimation)-Algorithmus [11] mit einer konstanten Lernrate von 
0.01 verwendet. In Bild 3a ist ein Ausschnitt der zur Anregung verwendeten 
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Bild 3: Vergleich der Modelle am Testprozess: a) verwendete Eingangsgrößen; b) Testprozessaus- 
gang und Modellausgang DYLAMOT; c) Testprozessausgang und Modellausgang LSTM; 
d) Fehler e Modellausgang DYLAMOT und e) Fehler e Modellausgang LSTM. 


APRB-Signale für die beiden Eingangsgrößen uı und un dargestellt. Aufgrund 
der Vergleichbarkeit zur Heizstrecke werden hierbei prozentuale Größen an- 
gegeben. In den Bildern 3b und c sind für beide Modelle die gemessenen 
und geschätzten Werte (y bzw. ĵ) der Ausgangsgröße zu sehen, basierend auf 
einem separaten Testdatensatz. Ebenfalls dargestellt ist der Fehler e bzw. die 
Abweichung zwischen gemessenem und geschätztem Wert Bilder 3d und e. 
Die hohe Modellgüte wird anhand des Vergleichs zwischen den gemessenen 
und geschätzten Werten in den Bildern 3b und c deutlich. Hierbei werden 
bei beiden Verfahren keine großen Abweichungen in den Verläufen festge- 
stellt. DYLAMOT erreicht dabei eine geringfügig bessere Modellgüte als das 
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LSTM-Netz. Dies spiegelt sich auch beim RMSE wider. Dieser nimmt fiir den 
gesamten Testdatensatz bei DYLAMOT einen Wert von 0,132 an und unter 
Verwendung des LSTM-Netzes einen Wert von 0,298. 


Für die Untersuchung der Extrapolationsfähigkeit werden die Amplituden der 
verwendeten APRB-Signale bei der Modellbildung verringert. Die Modellbil- 
dung findet mit APRB-Signalen statt, deren Amplituden aus dem Intervall 
[0.3,0.7] stammen. Daraufhin erfolgt die Generalisierung dieser Modelle auf 
Testdaten, deren Ursprung eine Anregung mit Amplituden aus dem Intervall 
[0.1,0.9] ist. Die trainierten Modelle werden folglich auf Daten getestet, deren 
minimalen und maximalen Werte die min. bzw. max. antrainierten Modellaus- 
gänge unter- bzw. überschreiten. In den Darstellungen 4a und b sind für beide 
Methoden die Eingangsgrößen aufgetragen sowie die Grenzen des antrainier- 
ten Bereichs. In den Bildern 4c und d sind die Modelle auf den amplitudenmo- 
difizierten Testdaten generalisiert, was durch das Überschreiten der min. und 
max. Modellausgänge aus dem Training deutlich wird. Bei einem Vergleich der 
Eingangsdaten mit den jeweiligen Prozess- und Modellausgängen werden die 
Extrapolationsbereiche sichtbar. Bei dem Vergleich der Verläufe in den Bildern 
4c und d wird deutlich, dass in diesen Bereichen unter der Verwendung von 
DYLAMOT die Abweichungen zwischen gemessenem und geschätztem Wert 
deutlich kleiner sind als beim LSTM-Netzwerk. Dieses Ergebnis wird auch 
vom RMSE widergespiegelt, der in diesem Fall für DYLAMOT 0,7554 und 
für das LSTM-Netzwerk 1,5957 beträgt. Ein Grund für dieses Ergebnis ist im 
Aufbau von LSTM-Netzen zu finden. Die einzelnen LSTM-Schichten basieren 
auf Sigmoid- und Tanh-Funktionen, die für die Aktualisierung der Zellzustände 
verantwortlich sind. Diese weisen entsprechend ihrer Verläufe einen unteren 
und einen oberen Sättigungsbereich auf, in denen das Wachstum gegen Null 
strebt. Diese Bereiche werden bei der Verwendung von Eingangsdaten, die 
außerhalb des antrainierten Modellbereichs liegen, erreicht. Diese Einschrän- 
kungen bestehen unter der Verwendung von DYLAMOT nicht. DYLAMOT 
weist ein lineares Extrapolationsverhalten auf, sodass keine unteren oder obe- 
ren Sättigungsbereiche erreicht werden können. 


Ein weiterer Untersuchungspunkt in diesem Beitrag zielt auf die Empfindlich- 
keit der erstellten Modelle gegenüber möglichen Rauscheinflüssen ab. Für die 
Untersuchung werden die erstellten Modelle herangezogen und mit Rauschen 
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Bild 4: Extrapolationsverhalten am Testprozess: a),b) verwendete Eingangsgrößen; c) Modellaus- 
gang DYLAMOT bei Generalisierung auf amplitudenmodifizierten Testdatensatz; d) Mo- 
dellausgang DYLAMOT bei Generalisierung auf amplitudenmodifizierten Testdatensatz. 


beaufschlagt. Dafür wird der Testprozess mit weißem Rauschen angeregt. Der 
Prozessausgang y wird dann zur Generalisierung der Modelle herangezogen. 
Zur Anregung wird ein weißes Rauschen mit einer Abtastzeit von 0,01 Sekun- 
den auf beide Eingänge aufgeschaltet. Für u; wird zusätzlich eine spektrale 
Leistungsdichte von 8W/Hz gewählt und für u2 ein Wert von 4W/Hz. In 
Darstellung 5 sind für beide Methoden die gemessenen und geschätzten Werte 
(y bzw. ¥) der Ausgangsgröße zu sehen. Beim Vergleich der Abbildungen ist 
deutlich zu erkennen, dass bei den gewählten Modell- und Rauschparametern 
das DYLAMOT-Modell (Bild 5a) stärker auf den verrauschten Eingang rea- 
giert als das LSTM-Netz (Bild 5b). Dies spiegeln ebenfalls die mittels FFT 
(fast Fourier transform) ermittelten Grenzfrequenzen des Prozessausgangs und 
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Bild 5: Empfindlichkeit der erstellten Modelle gegen weißem Rauschen: a) DYLAMOT; b) LSTM. 


Tabelle 1: Grenzfrequenzen des Testprozesses und der erstellten Modelle. 


Ausgang Grenzfrequenz rad/s 
Testprozess (Bode) 0,5000 
Testprozess (FFT) 0,4947 
DYLAMOT (FFT) 0,7727 
LSTM (FFT) 0,5283 


der jeweiligen Modellausgänge wider, die der Tabelle 1 entnommen werden 
können. Zur Kontrolle wird ebenfalls die Grenzfrequenz des Testprozesses 
aus dem Bode-Diagramm ermittelt. Die ermittelten Grenzfrequenzen aus dem 
Bode-Diagramm und der FFT sind wie erwartet beinahe identisch. Die größte 
Grenzfrequenz bei der Untersuchung weist DYLAMOT auf, was den Ergebnis- 
sen in Darstellung 5 entspricht. In dem unter der Verwendung von DYLAMOT 
erstellten Modell werden höhere Frequenzen deutlich weniger stark gedämpft 
als durch den eigentlichen Prozess und das erstellte LSTM-Modell. Eine Erklä- 
rung dafür bietet die Modellordnung, die für DYLAMOT gewählt wurde, um 
eine akzeptable und vergleichbare Modellgüte für den gewählten Testprozess 
zu erhalten. Man kann sagen, dass eine höhere Modellordnung zwar zu einer 
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besseren Prozessabbildung auf den Testdaten führt, das Modell aber anfälliger 
für Störeinflüsse macht. 


4 Anwendungsbeispiel - Heizstrecke 


Während im vorherigen Abschnitt die Untersuchungen an einem ausgewählten 
Testprozess stattfanden, wird nun ein realer nichtlinearer dynamischer Prozess 
herangezogen. Hierbei wird wie zuvor die jeweilige Modellgüte verglichen 
und die Extrapolationsfähigkeit untersucht. Auf eine Untersuchung der Rau- 
schempfindlichkeit wird am realen Prozess verzichtet. 


4.1 Systembeschreibung - Heizstrecke 


Für die Untersuchungen wird eine Heizstrecke der Firma „ELWE Technik“ 
verwendet. Es handelt sich dabei um ein Labormodell, das über ein xPC Target- 
System unter Matlab-Simulink angesteuert werden kann. In der Heizstrecke 
verbaut sind neben einer Heizung und einem Lüfter auch Sensoren, mit de- 
nen die Lufttemperatur und der Luftmassenstrom erfasst werden können. Des 
Weiteren befindet sich eine manuell verstellbare Drosselklappe am Lufteinlass. 
Das System mit zwei Eingangs-, zwei Ausgangs- und einer Störgröße kann 
somit als Mehrgrößensystem angesehen werden. Bei den Eingangsgrößen des 
Systems handelt es sich um die Lüfterleistung PL und die Heizleistung Py. 
Bei der für die Untersuchung betrachteten Ausgangsgröße handelt es sich um 
die gemessene Temperatur 7. Zur Anregung des Prozesses werden ebenfalls 
APRB-Signale verwendet und auf beide Eingänge uı und uz aufgeschaltet. Da- 
bei stammen die Amplituden aus dem Intervall [10,90]% der jeweils maximal 
möglichen Leistungen. Sonstige Einstellungen zur Systemanregung gleichen 
den Angaben, die für die Untersuchungen am Testprozess getroffen worden 
sind, und können dem vorherigen Kapitel entnommen werden. 
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Bild 6: Vergleich der Modelle an der Heizstrecke: a) DYLAMOT; b) LSTM. 


4.2 Modellvergleich - Heizstrecke 


Mit beiden Verfahren kann eine vergleichbar hohe Modellgiite erreicht wer- 
den. Dabei findet die jeweilige Modellbildung mit denselben Hyperparametern 
statt, die auch für die Untersuchungen am Testprozess gewählt worden sind. 
Die Modellgüte wird anhand des Vergleichs zwischen den gemessenen und 
geschätzten Werten (Bilder 6a und b) deutlich. Hierbei kommt es zwar an 
einigen Stellen in den Verläufen zu geringen Abweichungen, für den gesam- 
ten Verlauf kann aber von einer ausreichend hohen Modellgüte gesprochen 
werden. DYLAMOT erreicht dabei ein geringfügig besseres Ergebnis als das 
LSTM-Netz. Dies spiegelt sich auch beim RMSE wider. Dieser nimmt für den 
gesamten Testdatensatz bei DYLAMOT einen Wert von 0,3884 an und unter 
Verwendung des LSTM-Netzes einen Wert von 0,5121. 


Für die Untersuchung der Extrapolationsfähigkeit werden wie am Testpro- 
zess die Amplituden der zur Anregung verwendeten APRB-Signale bei der 
Modellbildung verringert. Die Modellbildung findet mit APRB-Signalen statt, 
deren Amplituden aus dem Intervall [30,70]% der jeweils maximal möglichen 
Leistungen stammen. Daraufhin erfolgt die Generalisierung dieser Modelle auf 
Testdaten, deren Ursprung eine Anregung mit Amplituden aus dem Intervall 
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[10,90]% ist. In den Darstellungen 7a und b sind fiir beide Methoden die 
Eingangsgrößen aufgetragen sowie die Grenzen des antrainierten Bereichs. 
In den Bildern 7c und d werden wie am Testprozess die Modelle auf den 
amplitudenmodifizierten Testdaten generalisiert, was durch das Überschreiten 
der min. und max. Modellausgänge aus dem Training deutlich wird. Durch den 
Vergleich der Eingangsdaten mit den jeweiligen Prozess- und Modellausgän- 
gen werden die Extrapolationsbereiche sichtbar. Hierbei werden die Ergebnisse 
aus den Untersuchungen am Testprozess bestätigt. Auch an der Heizstrecke 
kann mit DYLAMOT eine bessere Modellgüte erzielt werden als mit dem 
LSTM-Netzwerk. Beim Vergleich der Verläufe in den Bildern 7c und d wird 
deutlich, dass unter der Verwendung von DYLAMOT die Abweichungen zwi- 
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schen gemessenem und geschätztem Wert unter- und oberhalb des antrainierten 
Bereichs deutlich kleiner sind als beim LSTM-Netzwerk. Dieses Ergebnis wird 
auch vom RMSE widergespiegelt, der in diesem Fall fiir DYLAMOT 0,9848 
und für das LSTM-Netzwerk 1,8220 beträgt. 


5 Zusammenfassung und Ausblick 


In diesem Beitrag werden tiefe rekurrente neuronale Netze (LSTM-Netze) bzw. 
lokal-affine Zustandsraummodelle (DYLAMOT) zur Abbildung nichtlinearer 
dynamischer Systeme verwendet. Dabei wird die jeweils mit den Verfahren 
erzielte Modellgüte verglichen. Zusätzlich wird ein Vergleich hinsichtlich der 
Extrapolationsfähigkeit und der Rauschempfindlichkeit gezogen. Es wird ge- 
zeigt, dass unter den gewählten Rahmenbedingungen in beiden Anwendungen 
eine hohe Modellgüte erzielt werden kann, DYLAMOT jedoch am Testprozess 
sowie an der Heizstrecke ein geringfügig besseres Ergebnis erzielt. Des Weite- 
ren wird dargestellt, dass bei der Verwendung von DYLAMOT die Abweichun- 
gen zwischen gemessenem und geschätztem Wert außerhalb der antrainierten 
Bereiche deutlich kleiner sind als beim LSTM-Netzwerk. Zudem wird gezeigt, 
dass bei diesen Rahmenbedingungen das DYLAMOT-Modell stärker auf den 
verrauschten Eingang reagiert als das LSTM-Netz. Als weiterer Schritt in dem 
Methodenvergleich kann der Speicherbedarf der Modelle untersucht werden. 
Dies ist besonders bei einer Verwendung der Modelle im industriellen Um- 
feld von Bedeutung, wo Rechen- und Speichermöglichkeiten häufig begrenzt 
sind. Ein weiterer zu untersuchender Aspekt ist die zu erzielende Modellgüte 
bei der Verwendung verschiedener Anregungssignale. APRB-Signale können 
häufig in der Praxis, aufgrund der hohen Dynamik, nicht eingesetzt werden. Es 
werden stetige Anregungssignale gesucht, die in der Praxis sowohl einsetzbar 
sind als auch eine ausreichende Eingangsraumabdeckung vorweisen, um eine 
ausreichende Modellgüte zu erzielen. 
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Abstract 


In this paper, a new representation of the nonlinear dynamics of the tower crane 
system in Takagi-Sugeno (TS) fuzzy form is proposed and used for observer 
design. The TS fuzzy nonlinear observer is utilized to estimate unmeasurable 
states with guaranteed global asymptotic stability. The stability analysis is for- 
mulated as linear matrix inequalities (LMIs). The TS fuzzy model is equivalent 
to a reduced-order nonlinear model of the tower crane system with a varying 
cable length. For verification, simulation results of the reduced-order model, 
TS nonlinear fuzzy model and the estimated observer states are compared to 
the results of a tower crane on a laboratory scale. 


1 Introduction 


Cranes are widely used for heavy load transportation and are typecilly clas- 
sified into (1) tower cranes, (2) rotary cranes and (3) overhead cranes. Due 
to their wide range of applications on the construction site, tower cranes are 
the subject of investigations in automation and control engineering. It must 


Proc. 30. Workshop Computational Intelligence, Berlin, 26.-27.11.2020 181 


be taken into account that this type of crane system has a non-linear under- 
actuated complicated dynamics. Therefore, controlling of tower crane systems 
is a challenge. 


Various crane control techniques have been proposed to achieve precise positi- 
oning and oscillation suppression of the payloads. Model-based fuzzy control 
has been recognized as an alternative approach to conventional techniques for 
overhead cranes. Adaptive fuzzy sliding-mode control is designed to guarantee 
asymptotic stability for payload oscillations [1]. A discrete-time TS fuzzy 
observer and controller is designed by [3]. In addition, the Mamdani-type fuzzy 
approach was used in [2] to design an active anti-swing controller. 


Few of these techniques have been extended to the application of tower cranes 
[4]. Mainly conventional methods were used such as command shaping for 
oscillation reduction [5] and an optimal iterative method is presented in [6]. 
Just a fuzzy anti-swing controller for tower cranes based on the Mamdani type 
with consideration of friction and time delay was proposed in [7]. 


Regarding robust controllers, sliding mode control based on the nonlinear mo- 
del is proposed in [8] and extended to an adaptive scheme [9]. Optimal control 
has been proposed with the path-following method [10]. The parametric un- 
certainties are handled using adaptive control in [11], adaptive backstepping 
for 2D system is proposed in [12] and adaptive nonlinear integral sliding mode 
was investigated in the work [13]. 


The existing methods are developed based on simplified control-oriented mo- 
dels using approximations and assumptions of the original tower crane dyna- 
mics. The difference between the original dynamics and the simplified model 
affects the controllers’ performance and might lead to instability [11]. On 
the other hand, the consideration of the full nonlinear model increases the 
complexity of the control design and the closed-loop stability assessment. This 
complexity can be eliminated by reaching a proper reduced-order system. The 
main aim of this paper is to derive a suitable Takagi-Sugeno (TS) fuzzy model 
equivalent to a reduced-order nonlinear model for actual reflection of the sy- 
stem’s dynamics. The Takagi-Sugeno framework is chosen for the advantage 
of an exact representation of the original nonlinear model and facilitation of 
the stability analysis and observer design of nonlinear systems. [14]. 
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Moreover, the estimation convergence of state observer depends on the mat- 
hematical model [3]. Therefore, TS fuzzy observer is designed based on the 
aforementioned model. The observer is used to estimate the unmeasurable 
system states found in practical applications, such as the velocities that are 
required to be known for the controller. The previous closed-loop controllers’ 
feedback depended on differentiating the measured positions [10] and filtering 
the results [11], which cause a time delay and reduction in the accuracy of the 
feedback. 


This paper is organized as follows: Section II presents the nonlinear reduced or- 
der model of the tower crane and the state space representation. The equivalent 
TS fuzzy model is presented in Section III. In Section IV, the TS fuzzy observer 
is designed and its stability is analyzed. Section V presents a comparison study 
of the models, estimated observer states with the experimental results. Section 
VI presents the conclusion. 


2 Continuous-time Nonlinear Dynamical Model 


The model is derived based on the Euler Lagrange method [15] resulting in 
highly nonlinear under-actuated MIMO equations: 


d OL, ƏL 
di og ðq 


i=1,...,5 , (1) 


where g; is a vector of the tower crane’s five degrees of freedom shown in 
Figure 1. The vector q; = [x,,0,@,ß,1]” contains the trolley position, jib 
rotation, alpha, beta oscillation and the cable length respectively. The joint 
torque vector is given by Fy, = [F,, Fo, F;,0,0]", where F, denotes the trolley 
driving force, Fg denotes the tower rotating torque and F; denotes the cable 
driving force. 


2.1 Continuous-time Nonlinear Model 


The equations can be reformulated by dividing the tower crane’s DOF into (1) 
actuated states qı and (2) un-actuated states q2. The actuated states are the trol- 
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Jib 


Payload 
Tower 


Figure 1: Schematic Representation of Tower Crane 


ley position, the tower rotation and cable length denoted by qi =[ x, 0 I]. 
The un-actuated states are the swing angles of the payload where 
q2 =| a B ]”. Using the method of order analysis [1], the complete nonli- 
near model is reduced for each term [16]. Small swing angles are assumed [17], 
the un-actuated states’ equations are substituted into the actuated states’ equa- 
tions, and the actuated states’ equations are substituted into the un-actuated 
states’ equations, resulting in the following equations: 


B 1 m 1 
a+( = + Kos Ja g—a=—Kıu,, (2) 
m; rm; m; m; 
m 0 m 
(1+ —x7)6 + (Bo + Kx) —-g—?x,B = —-Kıue , (3) 
Jo Jo Jo 0 
B; 1 , 1 
n ( + Ko jie Kıu; , (4) 
Mp rmp xMp 
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(Up + mpl”) @+ Bad + gmp la + mp1 x, —mpx;l 0? —2m, 716 =0, 


mpx,16 + (Jp +mpl’)B + Bg B+gmplB =0, (5) 
where 
Fg, = Ky Ug) — K2 4 , (6) 
2 42 
Nai) Ke(qi) Kml; Nak e(aiykm(ai 
Vee we ae pas P (a) 
a(qi) a(qi) 


m; is the mass of the trolley, m, is the mass of the payload, B(,,) is the viscous 
friction coefficient, g is the gravitational constant, Jg and J, are the moment of 
inertia for the jib and load respectively. The motor parameters are: K,(,,) is the 
gear ratio, N(q,) is the motor gearbox and motor efficiency, km(q;) is the torque 
constant, rx is the radius of pulley, Ralqi) is the armature resistance and Galqi) 
is the amplifier gain. The equations of motion of the tower crane is in the form 
of: 


M(qa)ä+B(q,å)+Glq) =F, (7) 


where M(q) € R"*” is the inertia matrix which is a positive definite matrix 
for 1 > 0 and its inverse exists, B(q,g) € R"*! is the Coriolis, centripetal and 
friction matrix and G(q) € R"*! denotes the gravitational force vector. 


2.2 Continuous-time Nonlinear State Space Representation 


A representation of a given nonlinear system in Takagi-Sugeno form is obtai- 
ned in a compact set if the state space of a nonlinear system can be expressed 
as follows: 


è = f(x,u)x + g(x,u)u, 
y 


II 
=> 
an 
Ss 
= 
a 
= 


(8) 
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f., g and h are smooth nonlinear matrix functions and assumed to be bounded. 
x= [x;,4;,0,0,1,/, a, @,B,B]" is the state vector of (8) and u = [u,,ug,uj]" is 
the input vector. In detail with (2) - (5) this results in 


ži = x2 
Xo =A] 
X3=X4 
X4 = Ana 
x5 = X6 
te = [K1 u — (B1 + Ka) x6 
X 2 uj — X, 
6 my 14] l 2) ^6 (9) 
X7 = Xg 
yes m, X5(—Ay +x1 x4 +2 x19 X4X5 — x7 8) — Ba Xg 
mpx +Jp 
X = x10 
. _ — Bp x10 — gMpX5X9 — Mp AZ X1 X5 
x10 = 7 
(mp x5 +Jp) 
where 
1 
Aı = er Kıu. — X2(By + K2) +x7 gmp], 
t 
10 
fee + Ky) + gmp x1 x9 (10) 
mx? +Ip 


3 Nonlinear Dynamic TS Fuzzy Model 


In this paper the TS model of the tower crane is constructed analytically based 


on the previously presented state space model (9). 
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3.1 Sector Nonlinearity Approach 


Takagi-Sugeno fuzzy representation of the crane model is derived using sector 
non-linearity approach [18] to be used in the design process of the continuous- 
time observer. The scheduling variables are chosen as zj € 2,2 ls 
J =1,2,...,p where z j and Z; are the minimum and maximum values in the 
considered operating range respectively. The six premise variables are chosen 
as: 


1,61, 8, ——}* (11) 


= x; mp P +J, 


m x7 + Jo 
The system’s states are bounded and the bounds are based on the physical con- 
straints of the real system to be investigated [21]: zı € [0.22,0.52], 
z2 € [0.386,0.41], z3 € [0.15,1.2], z4 € [-0.15,1.2], zs € [-7/2,r/2] and 
z € [2.1,45]. The rules of the TS system are constructed: 


ifzyis Zi and... and zpisZ,, then 


Xx =Ajx+Bju,y = Cix , (12) 
where Zi, HA zi are the corresponding sets of the premise variables with the 
number of rules i = 1,2,...,7 equal to m = 2? = 64, where p = 6 is the number 


of premise variables and 2 is the number of weighting functions per premise 
variable. The nonlinear system is represented as TS fuzzy model in the form 
of 


eS PAOA TE; 
i=1 

y= PV b(OGx (13) 
i=1 


h;(z(t)) 2 0 are the normalized membership function with convex sum pro- 
perty Di |, hj(z(t)) = 1 [14] and is calculated as the product of the weighting 
functions: 


p h 
nt) = [Jw @) (14) 
j=l 
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where i; € {0,1}. For each zj, two weighting functions are constructed: 


wi=l-wi, j=1,2,...,p. (15) 


4 TS Fuzzy Observer 


In this section, the nonlinear observer is designed to estimate the unmeasured 
states relying on the system model (13): 


f= Yu [JA +BiutLi(y—9)], 


i=1 (16) 
Y=CK, 
where 
10000 00 0 0 0 
00100 00 0 0 0 
C=!0 000100 0 0 0 (17) 
00000 01 0 0 0 
00000 00 0 1 ~0 


that refers to the five measured states, and X is the estimated state vector, y is 
the estimated measurement and L; are the observer gains. 


4.1 Stability analysis via Lyapunov Approach 
The stability analysis is reduced to linear matrix inequality (LMI) problem, 


which is equivalent to finding solutions to original problems. The estimation 
error is defined as e = x — X, while the error dynamics are [19]: 


é=4-8=P ie) Ase) -Ly 


ll 
un 


(18) 


188 Proc. 30. Workshop Computational Intelligence, Berlin, 26.-27.11.2020 


Theorem 4.1 [14](page 64): The estimation error dynamics with common 
measurement matrix C in (18) is asymptotically stable, if there exist P = PT 
and L;, so that 


H(P(A;-L;C)) <0 (19) 


for alli = 1,2,...,m, where #(X) = XT +X. The following LMI problem is 
feasible using the variable M; = P Li. The performance measure is satisfied by 
adding a convergence rate of the observer, such that: 


H (PA;—M;C;)+2aP <0, (20) 


where & is the decay rate of the estimation error e. 


The stability condition of Theorem 4.1 is derived using the quadratic function 
V (e) =e! Pe. The derivative of the Lyapunov function is: 


V(e) =’ Pe+ePé 


hi(z) ((Ai —L;C)e)! Pe +e’ P(A; —L;C)e 


| 
Ms 


Il 
m 


hi(z) e" ((Ai— LiC)"P +P(A; — LiC))e < 0 (21) 


| 
Ms 


Il 
ʻi 


If a common positive definite matrix P = PT > 0 exists for all m = 64 fuzzy mo- 
dels and the Lyapunov function is decreasing, therefore the system is globally 
asymptotically stable. The observer by design guarantees, that the estimation 
error converges asymptotically to zero. The system equation are used to obtain 
M; for observer gains L; with Li = P-!M;. The solution of the observer is 
computed using YALMIP toolbox [20] and SEDUMI solver in MATLAB 
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Crane parameter mechanics 


Table 1: The values of the estimated parameters 


Crane parameter drives 


m = 0.7 Kg 

mp =0.32 Kg 

B, = 28 Nm/s 

Be = 14 Nm/s 

Bı = 19.5 Nm/s 
Ba = 0.001 Nm/s 


Bg = 0.001 Nm/s 


Jo = 1.7 Kgm? 


Jp = 0.023 Kgm? 
g =9.81 m/s? 


Ko(x,) = Kg) = 76.84 
K,(9) = 275 

Næ) = Ny = 0.36 
N(e) = 0.24 


kmx) = kmi) = 0.032 Nm/A 
kmo) = 0.0195 Nm/A 
(0) 
rx = 0.0375 m 
Ra(x,) = Rat) = 25 V/A 
Rae) =0.5 V/A 
Ggx = 15 
Gao) = Gai) = 12 
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(a) Trolley position x; (b) Trolley velocity x; 
Figure 2: Trolley motion 
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Figure 3: Jib motion 
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Figure 4: Payload motion: & coordinate 
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Figure 5: Payload motion: ß coordinate 
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Figure 6: Payload motion: cable coordinate / 
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Figure 7: Estimation error e = x —£ 


5 Experiments 


For the experimental investigations, a real-time data acquisition 
(RT-DAC /USB2 board) is used as an interface between the personal computer 
and the tower crane in laboratory scale [21]. The crane’s parameters used in 
the model equations were estimated by [22] using the prediction error method. 
Other parameters such as friction coefficients and the inertia are estimated 
using an off-line identification parameter estimation tool in MATLAB via sum 
square method based on the structure of the model. The motor parameters are 
presented by [17] and the parameter values are given in Table 1. The decay 
rate used is & = 10, hence the observer dynamics is faster than the dynamics 
of the closed loop system. The controller needs a fast reconstructed signal, 
giving an advantage over existing methods. The initial conditions used for the 
experimental setup and the model are equal to the home position. The home 
point of the mechanical system is equal to 0.22 m for the trolley position and 
zero for the rest of the states. The initial conditions for the estimated states 
are ŝo = [0.35, 0.0118, 0.894, 0.1991, 0.298, 0.15, 0.22, 0.46, 0.065, 0.98]7. 
The input used for moving the system is a pulse signal for 1 second and the 
corresponding experimental data is measured. This data is plotted against the 
results of the TS model and TS observer using the same input. The comparison 
between the three results is carried out for a typical crane maneuver which is 
a superposition of three motions: the trolley translation, jib rotation and cable 
motion. 
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From Figure 2 to 6 it can be seen that the error between the experimental data 
and the reduced-order model is equal to 0.62 cm in the trolley position x,, 
6 degrees in the jib position 0, 3 degrees in the œ oscillation and 4 degrees for 
the B oscillation. The velocities are calculated using the conventional method 
of differentiating the position and filtering the results. However, the controller 
will be based on the observer’s result to avoid using inaccurate feedback. The 
estimated velocities have the same profile as the filtered velocities while the 
differences are due to the differentiation error and the time delay found in the 
filtered data. Therefore, the results show that the reduced-order model captures 
the actual dynamics accurately with small magnitude of error. 


Moreover, in Figure 2 to Figure 6, the overlapping of the data shows that 
the obtained TS model is same as the original in the considered limits. The 
estimation error converges to zero in less than | second for all states as shown 
in Figure 7. 
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1 Einführung 


Im Rahmen des „HumFlow“-Projekts! am Fachbereich Elektrotechnik und 
Informationstechnik der Hochschule Fulda versuchen die Autoren das Pro- 
blem des Eingriffs in denkmalgeschützte Bausubstanz im Bereich der Mess- 
datenerfassung durch Einsatz von Funksensoren im Außen- und Innenbereich 
eines Raums zu lösen. Diese Sensoren messen die Temperaturen und relativen 
Luftfeuchtigkeiten der Wandoberfläche und sind gegenüberliegend an der Au- 
ßenwand und an der Innenwand ,,minimal-invasiv“ angebracht, d.h., dass zur 
Befestigung der Sensoren an der Wand keine signifikanten Beschädigungen 
entstehen. Dies ist insbesondere für denkmalgeschützte Gebäude interessant. 
Mit Hilfe der genannten Messgrößen und einem hygrothermischen Modell der 
Wand ist beispielsweise eine bessere Prädiktion des Innenraumklimas möglich. 
Die Vorhersagen können z.B. in einem modellprädiktiven Regelungskonzept 
weiterverarbeitet werden, was die Einhaltung der Raumklimaanforderungen 
gemäß der Präventiven Konservierung [1] sicherstellen kann. Weitere Anwen- 
dungsbereiche sind denkbar. 


! gefördert durch das Förderprogramm „Forschung für die Praxis“ des HMWK 
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Bild 1: Links: Querschnitt des „HumFlow“-Versuchsstandes. Das Gehäuse ist hellgrau dargestellt. 
Die Sensoren stellen dunkelgraue, gefüllte Kreise dar. Rechts: Anordnung der Kombi- 
Sensoren (dunkelgrau) in unterschiedlichen Tiefen im Wandelement (grau). 


Im vorliegenden Beitrag wird die datengetriebene Modellbildung der Wärme- 
und Feuchteübertragungsmechanismen in einem Wandsegment behandelt. Der 
Beitrag führt grundlegende Arbeiten aus [2] fort und erweitert diese um nicht- 
lineare Modellierungsansätze mit Hilfe künstlicher neuronaler Netze (KNN) 
sowie Takagi-Sugeno-Fuzzy-Systeme (TS) [3]. Der Versuchsaufbau ist aus- 
führlich in [2] beschrieben, wird im zweiten Abschnitt jedoch nochmals kurz 
skizziert. Nach einer kurzen Beschreibung der Methodik im dritten Abschnitt 
werden die Ergebnisse anschließend im vierten Abschnitt bewertet. 


2 Der Versuchstand 


Bild 1 skizziert den Versuchsaufbau. Er besteht aus einem einseitig offenen 
Gehäuse aus mit Phenolharz beschichtetem Sperrholz. In das Gehäuse kann ein 
herausnehmbares Wandsegment eingelassen werden, was das Gehäuseinnere 
in zwei Bereiche aufteilt. Die offene Seite des Gehäuses wird als „Bereich 2“ 
bezeichnet und steht in direktem Austausch mit der Umgebungsluft des Labors. 
Der durch das Wandsegment abgetrennte „Bereich 1“ beinhaltet ein elektri- 
sches ansteuerbares Heizelement sowie ein kompaktes Konstantfeuchtegerät. 
Hierdurch können verschiedene Temperatur- und Luftfeuchtebedingungen er- 
zeugt werden. 
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Zur Messung der Temperatur und der relativen Luftfeuchtigkeit in den Berei- 
chen 1 und 2 werden mehrere kombinierte Sensoren eingesetzt. Indizes 1 und 2 
zeigen den Messort an. Die Oberflachentemperaturen und -feuchten der Wand 
werden ebenfalls beidseitig mit jeweils einem Sensorpaar (Index surf) erfasst. 
Auch innerhalb der Wand sind zehn weitere kombinierte Sensoren, angeordnet 
in zwei Reihen, zur Erfassung der genannten Messgrößen in Bohrungen unter- 
schiedlicher Tiefe eingefasst (s. Bild 1 rechts). Insgesamt werden so zwanzig 
Messgrößen innerhalb des Wandsegments gemessen. Die obere und die untere 
Sensor-Reihe sind hierbei in derselben Tiefe angeordnet. 


3 Systemidentifikation 


Das hygrothermische Verhalten des Wandsegments (Lehm) wird mit Hilfe ei- 
nes linearen Zustandsraummodells (ZR), eines TS-Modells und eines KNN ab- 
gebildet. Die beiden letztgenannten Modellansätze sind in der Lage, nichtlinea- 
re dynamische Prozesse zu beschreiben [4]. Das ZR-Modell dient als Referenz, 
um die Güte des TS- bzw. KNN-Modells besser einordnen zu können. Zum 
Einsatz kommen die in der Software MATLAB integrierte System Identifica- 
tion Toolbox sowie die Deep Learning Toolbox. Letztere beinhaltet auch alle 
notwendigen Methoden zum Training flacher Netze. Zur Schätzung des TS- 
Modells wird der LOLIMOT-Algorithmus aus der LMN-Toolbox Version 1.5.2 
[5] verwendet. Die Modellstrukturen der nichtlinearen Modellansätze sind als 
nichtlineare autoregressive Modelle mit zusätzlichen Eingängen (NARX) in- 
terpretiertbar [4], siehe Gleichungen (1) und (2). 


Nb 


Sarx(k) = a East- 
u(k— 1) 


A 


1 
E a 


Jarx(k) = Ox = (bı,... Buy, —a1,...,—@n,) 


en 
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In obiger Gleichung ist der einfache Fall eines linearen Eingrößensystems dar- 
gestellt. Eine Erweiterung auf Mehrgrößensysteme gestaltet sich sehr einfach 
und wird deshalb hier nicht nochmals aufgeführt. Mit einer nichtlinearen Funk- 
tion f(-) entsteht ein NARX-Modell 


Snarx(k) = f(OR). (2) 


Der Regressorvektor x € RGSx1) 


setzt sich aus Messgrößen des letzten und 
vorletzten Abtastschrittes zusammen. Hierbei werden alle acht Messgrößen 
(Temperaturen und relativen Luftfeuchtigkeiten) außerhalb des Wandsegments 
als Eingänge aufgefasst. Die insgesamt zehn Messgrößen innerhalb der Wand 
(fünf Temperatur- und fünf Feuchtemesswerte) stellen die Ausgänge des Sys- 


tems dar. Damit gestaltet sich x wie folgt 
x’ = (x1 x2; 1), (3) 


wobei x, und x, mit y= 1,2 darstellbar sind als 


x = (HK=7), p(k- y), dlk- 7), pk- N), 
Ùi surf (k — Y), 1 surt(k — Y), D surf (k — Y), 92 surf (k — Y), 
Vw (k — Y), oe ., Bws(k— Y); Pw (k— Y), + @w,5s(k— N) à (4) 


Der ,,1“-Regressor in x ist für die Schätzung eines Offsetparameters sinnvoll. 
Damit ergibt sich eine Modellstruktur für das lineare ZR-Modell mit acht Ein- 
gängen und p = 10 Ausgängen. Zustände und Ausgänge sind equivalent. Alle 
Elemente der Systemmatrix und der Steuermatrix werden als frei zu schätzende 
Parameter festgelegt. 

Die Modellstrukturen für das TS- und KNN-Modell sind hingegen komplexer. 
Das TS-Modell bildet aus den M = 10 lokal-linearen ARX-Modellen und den 
i=1,...,M Parametermatrizen Ors, € R(?X36) mit Hilfe der Fuzzy-Basisfunk- 
tionen @;(z) eine gewichtete Summe für alle 9... „(k) € R(X!) Ausgänge 


M 
Sarg wk) = Hr.“ (5) 
i=l 
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Die Fuzzy-Basisfunktionen @;(z) haben ihren Wertebereich zwischen null und 
eins. Berechnet werden sie aus dem Mittel der Zugehörigkeitsgrade 


; M 
di) = aa w » aol. (6) 
È Hi (z) = 


Alle u;(z) Zugehörigkeitsfunktionen (ZF) werden als Gaußglocken festgelegt. 
Mit dem Zentrum v und der Standardabweichung o gilt für eine Gauss’sche 


ZF: r 
UGauss (x) = exp (S2) . (7) 


Der Schedulingvektor bzw. Prämissenvektor z dient als Regelwerk, nach dem 
die einzelnen lokal-linearen Modelle aktiviert werden. z enthält alle Regres- 
soren des letzten Abtastschrittes (x,, vgl. Gleichung (4)). Die Parameterschät- 
zung des TS-Modells in Gleichung (5) geschieht auf Basis einer linearen, 10- 
kalen Kleinste-Quadrate-Schatzung. 

Fiir das KNN werden Nin = p = 10 Neuronen fiir die Zwischenschicht bzw. 
die Ausgangsschicht festgelegt. Die Neuronen der Zwischenschicht werden 
jeweils über eine fiansig (-)-Aktivierungsfunktion gezündet, wobei die Neuronen 
der Ausgangsschicht jeweils eine lineare Abbildung darstellen. Zusätzlich wird 
ein Offsetvektor Bout € RPX”) auf die Modellausgänge addiert: 


ÎkNN,w (k)= W out ftansig (Oxnnx) + Bon , (8) 


wobei die Parametermatrix OxnN € R Win x36) 


die Regressoren entsprechend 
gewichtet und Wu € RP*Nn) die Ausgänge der Zwischenschicht nochmals 
gewichtet. Das Training des KNN-Modells benötigt einen nichtlinearen Para- 
meteroptimierer. Hier wird der Levenberg-Marquardt-Algorithmus eingesetzt. 
Alle Modelle werden auf einen knapp 30-wöchigen Trainingsdatensatz trai- 
niert und auf einen ungefähr einmonatigen Testdatensatz evaluiert. Beide Da- 
tensätze sind zehnminütig abgetastet. Nach dem erfolgreichen Training der 
Modelle werden diese auf dem Testdatensatz simuliert und deren Performanz 
ausgewertet. Eine Simulation heißt hier, dass nicht die realen Prozessausgänge 
- sprich die Messwerte der Temperaturen und relativen Luftfeuchtigkeiten in 
der Wand - in jedem Abtastschritt als Regressoren übergeben werden, sondern 
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die Modellausgänge selbst zurückgekoppelt werden. Die Bewertung der Per- 
fomanz eines auf unbekannten Daten simulierten Modells ist aufschlussreich 
hinsichtlich der Generalisierungsfahigkeit. Soll ein Modell als Grundlage fiir 
eine Mehrschrittprädiktion dienen, kann die Simulation des Modells ebenfalls 
von Vorteil sein, wenn der Prädiktionshorizont genügend groß ist. 

Die Rückkopplung der Modellausgänge hat zur Folge, dass keine ARX-Mo- 
dellstruktur mehr vorliegt. Es handelt sich hierbei um eine Output-Error-Struk- 
tur (OE). Weil OE-Modelle allerdings weitaus schwieriger zu trainieren sind 
bzw. einen höheren Zeitaufwand zur Parameterschätzung benötigen, werden 
in der Systemidentifikation i. d. R. ARX-Modellstrukturen für das Training 
herangezogen und inkonsistente Parameterschätzungen hingenommen. Gleich- 
wohl werden häufig gute Resultate mit dieser Methodik erzielt [4]. 


4 Ergebnisse 


Die grafische Gegenüberstellung sämtlicher Prozessausgänge mit den entspre- 
chenden Modellausgängen würde den Rahmen des Kurzbeitrags sprengen. Im 
Folgenden werden deshalb nur die relativen Feuchtigkeitsmesswerte des dritten 
Sensors (Ow,3, Pw,3) im Wandsegment mit den simulierten Modellausgängen 
verglichen. Ein Vergleich der Modellgiiten findet nur auf den Testdaten statt. 


In Bild 2 ist der Zeitreihenverlauf der relativen Luftfeuchtigkeit @y 3 mitsamt 
der Modellausgänge dargestellt. Wie auch die Güte- bzw. Fehlermaße in der 
Tabelle 1 zeigen, schneidet das KNN am besten ab, gefolgt vom TS-Fuzzy- 
System. Das lineare Zustandsraummodell liefert nur eine dürftige Modellgüte, 
weil es die Nichtlinearitäten des Prozesses nicht abbilden kann. Allerdings 
ist deutlich zu erkennen, dass das ZR-Modell die Temperaturverläufe in ei- 
nem ausreichenden Maße abbilden kann. Wesentlich schwieriger gestalten sich 
die Schätzungen der relativen Luftfeuchtigkeiten in der Wand. Die zugrunde- 
liegenden nichtlinearen Feuchtespeicher- und Feuchtetransporteffekte (Sorpti- 
onsisotherme, Diffusionsvorgänge, etc.) fallen hier stark ins Gewicht. 
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Bild 2: Gemessene relative Luftfeuchtigkeit @,3 innerhalb der Wand im Vergleich mit den 
simulierten Modellausgängen. 


5 Fazit und Ausblick 


Das bessere Abschneiden des KNN kann auf ein mehrmaliges erneutes Trainie- 
ren zurückgeführt werden, was notwendig gewesen ist, weil die Güte des KNN 
nicht zufriedenstellend war. Dieses „Re-Training“ hat allerdings neben einem 
hohen Rechenaufwand auch eine hohe Rechendauer beansprucht. Das mit dem 
LOLIMOT-Algorithmus synthetisierte TS-Modell erforderte nur einen einzi- 
gen Trainingsdurchlauf und liefert zufriedenstellende Güte- und Fehlermaße 
(s. Tabelle 1). Aus Sicht der Autoren ist das TS-Modell auch aufgrund der lo- 
kalen Interpretierbarkeit vorzuziehen. Das lineare Zustandsraummodell reicht 
aufgrund fehlender Flexibilität und unzureichender Genauigkeit zur Modellie- 
rung des Wärme- und Feuchtetransports durch ein Wandsegment nicht aus. 


Der Beitrag bietet einen Einblick in die laufenden Arbeiten am „HumFlow“- 
Projekt und zeigt, dass mit sehr einfach synthetisierten nichtlinearen Model- 
len bereits die Schätzung von Temperatur- und Feuchtigkeitsverteilungen in 
Wandsegmenten mit zufriedenstellender Genauigkeit ermöglicht werden kann. 
Aus bauphysikalischer Sicht wäre somit eine günstige und dauerhafte Schät- 
zung des Wärme- und Feuchtigkeitseintrags in eine Wand vorstellbar, ohne die 
Bausubstanz zu beschädigen. Diese Anwendung ist insbesondere im Bereich 
denkmalgeschützter Altbauten interessant. Nach jetzigem Kenntnisstand der 
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Tabelle 1: Normalized Mean Squared Error (NMSE) sowie Best Fit Rate (BFR) nach [6] der 
Modelle für jede Messgröße. Die Fehler- und Gütemaße für Pw 3 aus Bild 2 sind 
hervorgehoben. 


NMSE BFR in % 
Größe ZR TS KNN ZR TS KNN 


w, 0.3305 0.2047 0.0373 42.51 54.75 80.67 
Bw, 9.0318 0.0105 0.0020 82.18 89.77 95.56 
Qw2 0.6350 0.1919 0.0569 20.31 56.19 76.14 
Öw2 0.0567 0.0076 0.0013 76.19 91.28 96.33 
Qyv3 9.9620 0.1835 0.1454 1.92 57.16 61.87 
Bw3 0.0155 0.0092 0.0020 87.54 90.40 95.51 
Qwa 0.5557 0.1435 0.1908 25.46 62.12 56.32 
Bw 0.2853 0.0130 0.0069 46.59 88.61 91.70 
Qws 0.1291 0.0576 0.0791 64.07 76.01 71.87 
Öws 0.0596 0.0203 0.0104 75.59 85.74 89.78 


Autoren ist ein vergleichbares Messverfahren nicht am Markt erhältlich. Aus 
einer regelungstechnischen Perspektive kann der vorgestellte Ansatz wichti- 
ge Vorhersagewerte für Raumlufttemperatur- und Raumluftfeuchtigkeitsregel- 
ungen bereitstellen und somit bestehende Raumklima-Systeme erweitern. In 
künftigen Vorhaben soll der TS-Fuzzy-Ansatz weiter verfolgt, aber der Prä- 
missenraum mit Hilfe von Clustering-Algorithmen partitioniert werden [7]. 
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1 Introduction 


As part of Smart Factories, industrial processes must be optimized in terms of 
efficiency, flexibility and process reliability. This is primarily achieved by Ad- 
vanced Analytics, where data-driven models are used to analyze, describe and 
predict process behavior [1]. In this way, new process knowledge is gained and 
used, for instance, to adjust the operation mode of the process or reduce defects 
and quality problems [2]. These models need to fulfill certain requirements 
to be applicable in an industrial environment. In order to ensure a reliable 
operation of the plant and to enable optimization, they must be highly accurate. 
Furthermore, to gain process knowledge and confidence towards the operators 
and to fix model uncertainties more easily, they must be interpretable. 


Decision Trees are a model class that can fulfill these requirements [3]. They 
are algorithmic constructed and represented as a top-down directional acyclic 
graph, consisting of decision nodes and terminal leaves. This graph is specified 
as a tree, which starts with a single decision node, the root, and ends up 
in multiple terminal leaves. To predict an output variable ¥ (e.g. process 
behavior), an unlabeled sample x = [xı ... xy], which consists of M input 
variables xm with m € N | 1 < m < M, must pass through the tree until a 
terminal leaf is reached. Each decision node contains a test function, that 
is applied at x and effects the path that x passes through the tree. The test 
functions are usually formulated as univariate splitting criteria xm < c for c E€ R 
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or xm € Z for Z C æ with a numerical threshold value c or a subset Z of a 
merge of categorical attributes ./. Each terminal leaf contains a local model for 
prediction, that is only valid in a certain partition of the input space defined by 
trees’ structure. Because of the rule-based structure, trees are human readable 
and easy to interpret. They can predict both numerical (Regression Tree) and 
categorical (Classification Tree) output variables. In addition, numerical and 
categorical input variables as well as missing input values can be handled and 
the importance of input variables can be measured [3, 4, 5]. 


However, especially for Regression Trees with univariate splitting criteria, there 
are limitations which can result in lower model accuracy and interpretability 
[4, 6]. Univariate splitting criteria depend on a single input variable, resulting 
in axis-orthogonal splits that limit model flexibility. Depending on the process 
function, this leads to lower model accuracy and, if simple local models are 
used, to a larger tree, which reduces interpretability [7]. To overcome these 
issues, multivariate splitting criteria yy BmXm < c with M coefficients ß,, can 
be used to construct axis-oblique splits. The resulting tree is called Oblique 
Regression Tree or, if uni- and multivariate splitting criteria are used, Mixed 
Regression Tree [8]. The direction of an axis-oblique split has to adapt to the 
curvature of the function and is given by its coefficients Bm [8, 9, 10]. Furt- 
hermore, to maintain interpretability, to avoid overfitting and to overcome the 
curse of dimensionality, an efficient and generalized approach is necessary. 


In this paper, a novel algorithm to construct Mixed and Oblique Regression 
Trees is presented. To determine an axis-oblique split direction adapted to the 
curvature in a partition, a first-order Least Squares Regression (LSR) model is 
used. This model is limited to significant input variables to describe this curva- 
ture, which maintains interpretability and generalization. The input variables 
are selected by analyzing the residuals of the resulting splitting model, which 
additionally weakens the curse of dimensionality. To construct the local models 
for prediction, stepwise regression is used. In Section 2, common algorithms 
for the construction of Regression Trees are explained. The proposed algorithm 
is presented in Section 3 and tested in Section 4 in an extensive experimental 
analysis with both synthetic and real-world data. Moreover, the results are 
compared with a state-of-the-art construction algorithm. At the end, Section 5 
summarizes the paper and gives an overview on further research. 
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2 Construction Algorithms for Regression Trees 


In this Section, the functionality of algorithms to construct Regression Trees 
is explained. The functionality is described in more detail for the common 
algorithms SUPPORT [11], CART [12], GUIDE [6] and PHDRT [10], which 
generate the eponymous trees. 


Regression Trees are constructed by a divide-and-conquer strategy, which splits 
a set of N labeled samples 2 = {X,y} with X = [xı ... xy]! and the corre- 
sponding labels y = [yı ... yy]! recursively into smaller subsets 9, until a 
stopping criterion is reached. Each set of labeled samples % is represented 
as a node t with k € {1,2,...,K} within the tree T, that consists of |T| = K 
nodes [5]. 


The recursive splitting process to construct a tree is shown in Figure 1 and 
starts with the entire data set 2; = 9, represented by the root tı. At first, in 
step a) a stopping rule for the node ft; is check. The stopping rule ensures that 
only meaningful splits of 9, are performed and the size of the tree is limited. 
A common stopping rule is a lower bound of the number of samples in a node, 
which is used in all four algorithms [5]. 


In step b) of Figure 1, the node tų becomes a terminal leaf fx when splitting is 
stopped. Each terminal leaf represents a certain partition of the input space and 
contains a local model $;,(x) € R, that approximates the function within that 
partition. The local models of GUIDE and PHDRT are first-order multiple re- 
gression models and those of SUPPORT are third-order polynomial regression 
models [11, 6, 10]. Furthermore, SUPPORT combines all local models by a 
weighted average to create a continuous model output. For the local models of 
SUPPORT and PHDRT, all input variables are used. In contrast, GUIDE limits 
the local models to significant input variables using stepwise regression. The 
local models of CART are constant values, which are determined by the mean 
value of y in that partition [12]. 


The node tų will be further split if the stopping rule is not fulfilled. For this 
purpose, in step c) of Figure 1 the input variable(s) xm and the threshold value 
c or subset Z to construct an uni- or multivariate splitting criterion are selected. 
The components xm and c or Z are selected in a way that the impurity is 
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Figure 1: Recursive splitting process of a construction algorithm to create an unpruned tree T. 


reduced as much as possible by the resulting splitting criterion [5, 13]. In 
Regression Trees the impurity is described e.g. by the degree of non-linearity 
in a partition, which is measured by a specific quality criterion. To measure the 
non-linearity in a partition, the error of a linear approximation can be used. To 
split a node, CART uses both uni- and multivariate criteria, which are either 
selected by a brute force method (univariate) or by a heuristic-based selection 
method (multivariate) called Linear Combination Search Algorithm [12]. In 
contrast, SUPPORT is limited to univariate splitting criteria and numerical 
input variables. For split selection, the samples within a node are approximated 
by multiple linear regression and divided into two sample groups with positive 
and negative signs of residuals. Due to statistical tests for differences in mean 
and variance between these two sample groups, dependencies between input 
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variables and residuals are analyzed to select the most significant xm. The 
threshold value c is determined by averaging the means of the two groups 
of xm [11]. The splitting method of GUIDE is similar to SUPPORTs’ and 
differs in the use of a y7-independence tests, an interaction test between input 
variables and the ability to handle categorical input variables. Furthermore, 
due to a bootstrap-based bias correction, the significance of input variables is 
more comparable. Threshold values c are either determined by the median or 
mean of xm and subsets Z for a categorical input variable by a heuristic [6]. 
The splitting of PHDRT is limited to numerical input variables and multivariate 
criteria, which are determined by the first component of principal Hessian Di- 
rections (PHD). This component describes the direction in which the function 
to be approximated has the greatest curvature. To select a threshold value, the 
residuals of a multiple linear regression model are split into two partitions and 
approximated by one linear regression model each, using the first component 
of PHD as an input variable. The balance of the partitions is adjusted so that 
both linear models approximate the residuals with a similar standard deviation. 
Finally, the point of intersection is taken as the threshold value [10]. 


If a suitable split is determined, in d) of Figure 1 the splitting criterion sg is 
constructed based on the selected components. In addition, the node is split into 
two nodes Zj7|+1 and fj7|+2 containing the subsets Jrj}; and Zrj+2. Recursive 
splitting is completed when no more nodes can be split [5, 13]. 


Further approaches to limit the size of the tree are pre- and postpruning techni- 
ques. Prepruning is closely related to the stopping rule and limits the size du- 
ring the construction. In contrast, postpruning is applied after the construction 
and prunes an oversized tree backwards to a more generalized one. For this 
purpose, PHDRT stops splitting if the first component of PHD is insignificant, 
which is more similar to a stopping rule than to a prepruning technique [10]. 
The complexity of SUPPORT is limited by a prepruning technique, which uses 
cross validation to check whether a subtree can be created from t that signifi- 
cantly improves model quality [11]. Both CART and GUIDE limit the size by 
a postpruning technique called Minimal Cost Complexity Pruning, which uses 
cross validation to evaluate the generalization capabilities of different subtrees 
during the pruning [12, 6]. In the following, the proposed least-squares-based 
tree construction algorithm is explained in more detailed. 
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3 Least-Squares-Based Construction Algorithm 


Model quality of Regression Trees can be improved by an extension to Oblique 
or Mixed Regression Trees using multivariate splitting criteria. To obtain the 
advantages of Regression Trees, a trade-off must be found between an increase 
in model accuracy and a loss of interpretability. This is a challenging task. In 
order to achieve this, the axis-oblique split direction of a multivariate splitting 
criterion must adapt to the function gradient V f(x) in a curvature area. Furt- 
hermore, the complexity of this criterion must be limited in such a way that 
interpretability is maintained. To determine this criterion with an appropriate 
computational effort, the curse of dimensionality must be weakened [6, 4, 8]. 


To construct the splitting criterion, the proposed algorithm uses the coefficients 
of a first-order LSR model ¥,, (x). The input variables of ¥,, (x) are selected 
in a way that ¥,,(x) adapts to the gradient V f(x) in the area of curvature 
in that partition. This is performed by a forward selection method (FSM) 
and depending on the number of selected input variables either an uni- or 
multivariate splitting criterion is constructed. A model with a single input 
variable constructs an univariate splitting criterion and a model with multiple 
input variables a multivariate splitting criterion. The split direction is defined 
by the contour lines, contour planes or contour hyperplanes (depending on M) 
of fs, (x) and by selecting a suitable output value of $,, (x) as a threshold, the 
position of the split is adjusted, which is explained in Subsection 3.1. The 
quality of the resulting split is measured by a criterion which analyzes the resi- 
duals of Ss, (x) using a hinge function h($,,). Due to a reduction of the search 
space to the one-dimensional space of residuals, the FSM overcomes the curse 
of dimensionality. Furthermore, by limiting the number of significant input 
variables to a maximum of A, interpretability and generalization is maintained. 
In Subsection 3.2 the FSM is presented in more detail. 


If the sample size is too small or an insufficient improvement in model quality 
is achieved, splitting is stopped and a local model for prediction is determined. 
The local models are also determined by LSR and a FSM, which is explained in 
Subsection 3.3. To control the size of tree called Least Squares Regression Tree 
(LSRT), the technique Minimal Cost Complexity Pruning is used [12]. Finally, 
Subsection 3.4 shows the structure of LSRT using a practical example. 


212 Proc. 30. Workshop Computational Intelligence, Berlin, 26.-27.11.2020 


3.1 Axis-Oblique Split Direction 


To get an axis-oblique split direction that is adapted to the gradient V f(x) in 
the area of curvature within a partition, a direction orthogonal to Vf (x) has to 
be determined. This is achieved using a first-order LSR model 


M 
I(x) = Po + $, Bm%m - (1) 
m=1 


If the non-linearity in a local area is not excessive, a direction orthogonal to 
V f(x) is obtained in this way. The coefficients of the model are determined by 


B = [Bo ... Bul’ = (XTX) 'XTy (2) 


using the expanded N x (1 +M) predictor matrix 


1 x, L xıı x12 = Xi 
1x 1 mı X22 ++ XM 

X= =|. . : ne (3) 
1 xy l xvi XN2 +: XNM 


that consists of N samples x, and an additional column of ones to determine 
the constant part Po of the LSR model [9]. 


Figure 2a shows a first-order LSR model §,,(x), that was trained on the 20 
samples generated by a test function f(x) = xıx2. A contour line for the 
constant model output Ss, (x) = @ is formed by the various input combinations 
which result in & and runs as an axis-oblique border through the input space. 
The direction of the contour line results from the coefficients [ßı ... Bm]? 
and is orthogonal to V¥,, (x), which is presented in Figure 2b. This Figure 
shows three possible contour lines resulting from œ € {0,1.1,2.1} and şs, (x) 
in Figure 2a. The contour lines are splitting the input space into two partitions 
and by varying &, they are parallel shifted. This allows to determine a suitable 
threshold value for splitting. Finally, to construct the multivariate splitting 
criterion 


M 
P Bm%m < &— Bo , (4) 


m=1 
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(a) Model output fs, (x) of a first-order LSR (b) Three oblique splits that result from 
model (grid), trained on 20 samples the contour lines of the model and are 
generated from f(x) = x1x2. orthogonal to Vs, (x). 


Figure 2: Construction of axis-oblique splits using contour lines of aLSR model. 


the constant part ßo of the LSR model is subtracted. The threshold value 
c = @ — Po as well as the input variables to construct the LSR model are 
determined by aFSM, which is explained in the following Subsection. 


3.2 Split Selection 


In order to construct an uni- or multivariate splitting criterion with regard to the 
requirements of approximation capability, interpretability and generalization, 
suitable input variables for the LSR model and a suitable threshold value c 
must be selected. This is achieved by the FSM presented in Figure 3. 


At first, the local optimal quality value y* € R, a maximum number of input 
variables A € N | 1 <A < M to limit the complexity of the splitting crite- 
rion and the index m are initialized. Furthermore, the selected input variables 
x* € Rİ to construct the final splitting criterion are initialized with x* = [1] 
for the constant part Bo. Each forward iteration performs a greedy search over 
all unselected input variables to extend an existing splitting criterion (resulting 
from x*) by a local optimal candidate input variable xm. To identify the local 
optimal candidate during the greedy search, the quality of the splitting crite- 
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estimate LSR model with [x* xm] and compute residuals 


fit hinge function into esr, compute new residuals and ratio y 
b 


hs), en, C= lenl/leise! 


create new suitable c* =hinge, y* = 7, 
vo : : x ok = * 
splitting criterion B* =B, Xtemp = [X* Xm] 


Figure 3: Activity diagram of the FSM to construct a suitable uni- or multivariate splitting criterion. 


rion resulting from an extension with xm is measured using a specific quality 
criterion. 


To measure the quality which results from the extension, in step a) a LSR model 
Js, (X*, Xm) for splitting is determined. This model is constructed based on the 
previously selected input variables x* and the candidate xm. The residuals [9] 


Clsr = [e1 ee en]! = ¥=J = [yi — Îsy,1 <- YN jun" (5) 
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Psy 


Figure 4: Example of the process to measure the quality of a splitting criterion. The residuals e 
of a LSR model ¥,, (x*) are approximated by a hinge function h($s,) (solid line), which 
consists of two LSR models trained by different subsets .%,. The quality is measured by 
the improvements in approximation capability by A(s, ). 


of the model output Ss, n = Psy (X; Xn,m) With n € {1,...N} are computed and 
in step b) approximated by a hinge function 


h($s,) = min([1 $s, JB, [1 $s, )B2) or h(Ss,) = max((1 3s, )B [1 $s, )B3) . (6) 


The hinge function consists of two local linear LSR models B ; = [Bo Bi]! vie 
{1,2}, which are joined together by a hinge point [9, 14]. 


Figure 4 shows a hinge function (solid line), that was determined by N = 24 
samples & = {fs eir}. To construct A(fs,), the samples are ordered and 
segmented into K subsets .~%, containing an equal number of samples. This 
segmentation helps to overcome the challenging effects of skewed data. The 
subsets resulting from the segmentation are grouped into Gert and Sign: to de- 
termine B, with deg and B, with Grignt- This is done iterative by changing the 
proportion of the groups until a stopping criterion is reached. In Figure 4, K = 8 
subsets are used. First, the models are determined by two balanced groups 
ett > {A1, 2,73, F4} and bign D 15, 6, F1, Zz}, which is illustrated 
in Figure 4 by the gray and white area. If the resulting hinge point is out of a 
predefined area, e.g. outside the subgroups {.%3,...,.“x—2}, the balance of the 
groups is adjusted and two new LSR models are determined by the adjusted 
groups. Otherwise, the stopping criterion is fulfilled and the quality of the 
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splitting criterion, resulting from the candidate xm and the hinge point as a 
threshold value [10], is measured. 


The quality is measured by a specific criterion, which results in the quality 


value 
_ len| 


= leisr| 


(7) 


with the residuals ep = [eı -h($,1) --- ev — Asin )]" ofh(S,,). The quality 
criterion describes how much the non-linearity in a partition can be reduced 
along a certain direction by the splitting criterion. A decrease of non-linearity 
is indicated by y < 1 and to fulfill the quality criterion within a forward ite- 
ration, the candidate x,, has to be selected in a way that y is minimized to 
y‘. Due to this minimization, the direction of non-linearity in a partition is 
identified which can most likely be approximated by two local linear models. 
Furthermore, this minimization effects that 


e the orientation of Ss, (x*,xm) becomes more similar to V f (x) 
in the area of curvature in that partition. 


e the partition is split in an area near the curvature due to the hinge point. 


Apart from these improvements, the computational effort to select a suitable 
splitting criterion increases linearly by M - À, whereby the curse of dimensio- 
nality is weakened. 


If a new local optimal quality value is measured (y < 7*), in step c) of Figure 3 
anew Suitable splitting criterion is created. After the greedy search was applied 
(m = M) and no suitable candidate was identified (y > y*), the whole FSM is 
stopped. Otherwise, x* is extended by x,, that minimizes Y* and, if complexity 
limitation isn’t reached (dim(x*) <A + 1), the FSM is continued. The FSM is 
successfully completed if at least one suitable candidate has been identified. In 
this case ß* and c* construct either an uni- or multivariate splitting criterion. If 
no input variables are selected by the FSM or if a minimal number of samples is 
reached, a local model for prediction is determined, which is described next. 
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3.3 Local Models 


Similar to the requirements on the multivariate splitting criterion, the local 
models must be accurate, as interpretable as possible and well generalized. 
In order to achieve this, stepwise regression with another FSM is performed. 
Due to the bias-corrected Akaike’s Information Criterion 


2(M +. 2)(M +3) 
N-(M+2)-1 ’ 


RSS = 
AICc = Nlog N +2M +N +Nlog(27) 4 (8) 
which is embedded into the FSM, both the model accuracy and complexity 
are taken into account during the forward selection [15]. The first term of (8) 
considers the model accuracy using the residual sum of squares 


N 
RSS=} e (9) 

n=1 
and the remaining terms are considering model complexity using the dimension 
of selected input variables M and the number of samples N. AICc differs from 
the uncorrected criterion through the additional bias-correction term resulting 
from the last fraction, which leads to an improvement in accuracy for small 
data sets or high dimensional input spaces [15]. In this paper, the method is 

limited to a first-order LSR model 


aA) =Bo+ $ Prim; (10) 


méN | xmEX 


which is constructed by the selected input variables X with dim(X) = M. This 
is illustrated next using a practical example. 


3.4 Tree Structure 


The proposed algorithm constructs a binary tree, called LSRT, using both uni- 
and multivariate splitting criteria. Figure 5c shows a Mixed Regression Tree 
that approximates the function f(x) = 5x1x2 +2 +x3 which is presented in 
Figure 5a. 
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7 


7a 
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I 
I 
1 
I 
I 
1 
1 
I 
I 
L 


X1 


; = 2 , .2 (b) Partitions resulting from LSRT and repre- 
(a) Test function f(x) = 5x1x2 +x] +43 sented by ñ. The splits (black dashed li- 


on which LSRT was trained using So 
nes) are adapted to functions’ gray contour 
50 samples generated from f(x). ines 


0.40x1 + 0.92x2 
< 0.01 


0.78x1 + 0.63x2 
< —0.46 


false 


0.72x; +0.70x2 
< 0.57 


false di, (x) = -3.38+ 
4.69x| + 5.03x2 


a) = Hm) =109-— Hlmı)= S(x) = 0.214 
0.04 —1.26x2 2.57xı +1.06x2 0.38 +2.48x1 0.69x; + 1.45x2 


(c) LSRT consisting of uni- and multivariate splitting criteria (right to 1.) and local linear 
models fz, (below i). 


Figure 5: Mixed Regression Tree which is called LSRT and constructed by the proposed algorithm. 
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The tree was trained on 50 samples and to cover the whole input space, these 
samples and generated from an optimized Latin Hypercube Design [16]. The 
tree consists of five decision nodes 1; V k € {1,2,3,5,6} with an uni- or mul- 
tivariate splitting criterion displayed to the right of the node and six terminal 
leaves i V k € {4,8,9, 10, 11,7} with a local model §;, (X). 


Figure 5b presents the partitions of the input space resulting from the splitting 
criteria, which are drawn by black dashed lines. For instance, the axis-oblique 
split 1) results from the multivariate splitting criterion of the root t; and the 
axis-orthogonal split 2) results from the univariate splitting criterion of node 
ts. Each partition contains a first-order LSR model $j, (X), which is only valid 
in its partition. It can be recognized that the direction of the splits are adapted 
to the gray contour lines of f(x), which indicates that the algorithm determines 
suitable splits. To investigate the performance of the algorithm in more detail, 
in the following an extensive experimental analysis is performed. 


4 Experimental Analysis 


In order to analyze the proposed algorithm with regard to accuracy and model 
complexity, the algorithm is tested in Subsection 4.1 on synthetic data and in 
Subsection 4.2 on real-world data. Furthermore, to compare the performance 
to state-of-the-art construction algorithms for Regression Trees, LSRT is com- 
pared to GUIDE, which is determined by a toolbox [17]. 


To obtain comparable results among LSRT and GUIDE, the trees are con- 
structed based on similar hyperparamters. Both trees are pruned by the same 
method using the same hyperparameters and splitting is stopped by a lower 
bound of six samples. In addition, the local models are both determined using 
a FSM. To ensure the interpretability of LSRT, multivariate splitting criteria 
are limited to A = 2. Although both trees are constructed on similar hyper- 
parameters, the size of the pruned tree can vary between LSRT and GUIDE. 
For a comparison without the restriction of the tree size, a third tree LSRT qj is 
considered that is pruned to the same size as GUIDE. 


220 Proc. 30. Workshop Computational Intelligence, Berlin, 26.-27.11.2020 


Test results are evaluated by the root mean squared error E and the tree size |T], 
measured by the number of nodes within the tree T. To generate meaningful 
results, E and |T| are averaged over 150 runs. 


4.1 Synthetic Data 


To generate synthetic data, a common test function from [18] is extended to 


I 
10 . 20 10 5 
f(x) =) = sin(axs;_4rsi-3)+ 2 (xsi-2— 0.5)? 4 z*si-1+ 7%, (11) 


so that the dimensionality of x € R” | M =1:5 can be varied in discrete steps 
I € N of five. In this way, the influence of dimensionality can be analyzed. 
The input space is limited to O < xm < 1 and the N samples for training are 
generated from an optimized Latin Hypercube Design [16] to fill the whole 
input space. To analyze the influence of noise, white Gaussian noise € with 
mean & = 0 and variance 0? = 0.5 is added to f(x). Furthermore, M, noisy 
input variables without a dependence on f(x) are constructed using € with 
é = 0.5 and o? = 0.1. In each of the 150 runs, 1500 samples are randomly 
generated from (11) for testing. Table 1 shows the experimental results on 
eight synthetic data sets. 


For all data sets, LSRT and LSRT aq; are more accurate than GUIDE. Compared 
to GUIDE, the error of LSRT is reduced by 18.8% and the error of LSRT aq;, 
which has the same complexity as GUIDE, is reduced by 13.6%. Furthermore, 
it can be recognized that the difference in error between GUIDE and LSRT ag; 
is slight for data set {50,5,0,0}, whereas the difference for {300,5,0,0} is 
significant (20.2%). These differences result from the mixed tree structure 
of LSRT. Due to the properties of f(x), axis-oblique splits occur in deeper 
layers of LSRT. With an average tree size of |T| = 2.2 in data set {50,5,0,0}, 
LSRT ag; consists only of univariate splits. In contrast, LSRTaaj with an average 
size of |T| = 15.6 consists of several axis-oblique splits, which demonstrates 
the improvements resulting from the oblique splits. 


Table | shows that an influence of noise can be handled well by the proposed 
algorithm. The results of LSRT for the noisy data sets {300,5,0,0.5} and 
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Table 1: Experimental results on synthetic data for two different trees LSRT and GUIDE. 
LSRT,p; is an complexity adjusted version of LSRT with the same size as GUIDE. The 
elements in the brackets (left column) indicate the properties of the eight data sets. The 
best results for model complexity |T| and test error E are in bold print. 


Settings LSRT GUIDE LSRT gj 
{N,M, Mn, 02} IT| E+ o IT| E+ o E+o 
150,5,0,0} 3.9 2.13+0.19 2.2 2.35+0.20 2.31+0.30 
{100,5,0,0} 7.1 1.73+0.27 4.1 2.07+0.16 1.94+0.20 


{200,5,0,0} 11.0 1.2340.14 11.6 1.52+0.15 1.230.17 
{300,5,0,0} 11.4 1.09+0.26 15.6 1.19+0.26 0.95+0.14 
{300, 10,0,0} 11.0 1.42+0.17 82 1.92+0.32  1.68+0.32 
{300, 15,0,0} 9.5 1.67+0.19 4.1 2.17#0.13 2.08+0.21 
{300,5,5,0} 11.6 1.11+0.18 13.1 1.640.41 1.24+0.36 
{300,5,0,0.5} 11.9 1.10+0.23 13.8 1.30+#0.22 1.03#0.13 


{300,5,5,0} are similar to the results of {300,5,0,0}. In addition, the error 
of LSRT resulting from {300,5,5,0} is 32.3% lower than the error of GUIDE. 
An influence of dimensionality cannot be evaluated clearly. Due to an increase 
of the function values by the addition of further terms, E is equally increa- 
sed. Based on the error reduction (23.0%) between LSRT and GUIDE for 
{300, 15,0,0}, it can be expected that the curse of dimensions is weakened. 


In four out of eight data sets, both LSRT and GUIDE have the lowest complex- 
ity, which means that both trees achieve comparable results in interpretability. 
A comparison between LSRT and LSRT;a; shows that for {300,5,0,0} and 
{300,5,0,0.5} tree size was penalized too much by the pruning method. This 
can be recognized by the lower test error of LSRT;a;. In the following, LSRT 
is tested in a more challenging task using real-world data. 


4.2 Real-World Data 


In contrast to synthetic data, real-world data provides more challenging tasks 
for data driven-models due to incomplete samples, outliers and skewed data. To 
analyze and compare the proposed algorithm with regard to a more challenging 
task, four different real-world data sets Baseball, Tecator, CPU and Redwine 
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Table 2: Experimental results on real-word data for two different trees LSRT and GUIDE. 
LSRT,»; is an complexity adjusted version of LSRT with the same size as GUIDE. The 
real-world data sets Baseball and CPU are scaled by 10~? and 10”. 


LSRT GUIDE LSRT ag; 
Data sets IT| E+too \T | E+ o Eto 


Baseball 3.1 2.273+0.118 3.0 2.346+0.088 2.255+0.101 
Tecator 3.6 0.903+0.057 1.6 0.926+0.065 0.858+0.040 
CPU 5.2 5.214+0.800 4.3 5.129+0.514  5.471+1.157 
Redwine 3.3 0.645+0.007 1.8 0.6530.007  0.652+0.007 


with a dimension from 6 to 24 input variables and a size from 209 to 1599 
samples are used [19, 20, 21, 22]. Because of LSRTs’ limitation to numerical 
input variables, categorical input variables are excluded from the data sets. In 
addition, the Tecator data set is reduced by 100 input variables containing the 
absorbance spectrum. Because of the small sample size of CPU with N = 209 
and Tecator with N = 240 the analysis is performed by a k-fold cross validation. 
Within each run and each data set, k trees of each type are trained by k— 1 
varying data subsets, which predict the remaining data subset [9]. For this 
purpose, CPU and Tecator are analyzed by k = 10, Baseball by k = 5 and 
Redwine by k = 2. The results are shown in Table 2. 


Compared to the results on synthetic data, the improvements by the proposed 
algorithm are less significant. On average, the error of LSRT is 1.3% and the 
error of LSRT pj is 1.2% less than that of GUIDE. Furthermore, GUIDE is less 
complex for each data set. Nevertheless, due to a maximum size of |T| = 5.2 
there are no limitations in interpretability. 


Improvements of an axis-oblique structure are only apparent at the Baseball 
data set. The root of LSRT is split by a multivariate criterion, which reduces 
the error versus GUIDE by 3.9%. For Tecator, the error of LSRT gp; is 7.3% 
less than the error of GUIDE. Because of the small tree size (|T| = 1.6), the 
error reduction only result from the FSM to determine the local model. This 
can be explained by distinct linear dependencies, which can be well fitted with 
a single multiple regression model. A multivariate split of the root, which is 
performed by LSRT (|T| = 3.6), results in an increase of error. Due to Tecators’ 
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dimension of 24, this could be caused either by a wrong split selection or by a 
limitation A = 2 of two input variables for splitting. Compared to GUIDE, 
LSRT performs slightly worse on the CPU data set. Five out of six input 
variables of CPU are discrete, so the segmentation process of the split selection 
(compare Subsection 3.2) does not work correctly anymore. The Redwine data 
set contains much noise and functional dependencies are low. Therefore, an 
increase in accuracy of LSRT may result from an increase in complexity. 


5 Conclusion 


To solve the issues of Regression Trees with respect to model accuracy, their 
structure can be extended to Mixed or Oblique Regression Trees using axis- 
oblique splits. In order to obtain the advantages of Regression Trees when 
using axis-oblique splits, a trade-off between an increase in model accuracy 
and a loss of interpretability must be found. In this paper, a novel construction 
algorithm for Mixed and Oblique Regression Trees was presented. The di- 
rection for splitting within a partition is determined by a first-order LSR model 
Js, (x), which is limited to a maximum number of significant input variables 
x* due to a forward selection method. Depending on the number of selected 
input variables, this direction can be either axis-orthogonal or axis-oblique. 
The selection of $,,(x*) is based on a quality criterion, which is determined 
by an approximation of candidate models’ residuals using a hinge function. In 
this way, a split direction adapted to functions’ curvature within the partition 
is obtained and the resulting one-dimensional search space for the selection 
weakens the curse of dimensionality. By the limitation to significant input 
variables, interpretability and generalization is maintained. The proposed al- 
gorithm was tested in an extensive experimental analysis using synthetic and 
real-world data and compared with a state-of-the-art algorithm for Regression 
Trees. Especially for synthetic data, significant improvements in model accu- 
racy are achieved, resulting in lower test error compared to the state-of-the-art 
algorithm. The improvements for real-world data were less significant due to 
effects like discrete input values and partially unsuitable data sets. To obtain 
meaningful results on real-world data, further experiments are necessary. 
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For further improvements on real-world data, statistical test and an approach to 
consider categorical input variables and discrete values can be included into the 
split selection process. To improve the over-all performance of the proposed 
algorithm, the hinge function should be determined by a common algorithm 
in combination with a bootstrapping process. Stepwise selection combined 
with a complexity penalty for Ss, (x*) could also provide further improvements 
in split selection. Furthermore, to increase model flexibility by curved splits, 
Js, (X*) could be extended to a higher order. Additionally, trees’ structure can 
be extended to a neuro-fuzzy structure and to decrease the computational effort 
of the proposed algorithm, an efficient technique for prepruning is necessary. 
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Abstract 


In this contribution, a new approach for optimizing LH designs based on the 
estimation and evaluation of pdfs is presented. The proposed algorithm mi- 
nimizes the mean absolute error between the estimated pdf of the LH design, 
evaluated solely on its data points, and the uniform distribution. To validate the 
functionality of the new approach, it is compared to other state-of-the-art met- 
hods to create space-filling designs. The methods are compared using the KL 
divergence of the resulting datasets and the uniform distribution, as well as the 
resulting computation times for various dimensions and number of data points. 
Overall, the KL divergence performance of the new approach is outstanding, 
but expensive in terms of the computational demand. An additional benefit of 
the proposed approach is that it allows higher flexibility for DoE desgins. For 
example, it can be extended to approach any arbitrary point distribution, not 
just uniform, and may be suitable for the integration of constrains. 


1 Introduction 


The training data point distribution in the input space, also called the experi- 
mental design, is an important influencing factor regarding the quality of data- 
driven models. For this reason, an assessment of the quality of the design of 
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experiments (DoE) is important before measurements take place. If no prior 
knowledge about a system exists, uniformly distributed input data should be 
used [2]. Furthermore, besides being space-filling in the original input space, 
two other properties are advantageous and concern the projection of the data 
onto the individual input axes: a) one-dimensional uniform distribution on each 
axis (1Duni) and b) non-collapsing design which means that the projected data 
points stay distinct in their 1D projections. 


A well-known strategy used for the experimental design that places points 
on a grid while avoiding the curse of dimensionality is the Latin hypercube 
(LH) design. LH designs fulfill the 1Duni property, but do not inherently 
provide a uniform data distribution. Therefore, several loss functions were 
proposed, which try to rate uniformity of the data point distribution, such as 
maximin or $p [1]. These loss functions focus on the data point pair with the 
smallest distance which makes them suitable for optimization purposes, e.g. 
for the optimization of LH designs as in [3]. These approaches drive all points 
away from each other during optimization, thereby creating a uniform point 
distribution. On the downside, these local approaches are structurally not able 
to rate the overall distribution quality, but can be utilized for optimization. It 
can be carried out by local search, e.g., based on point exchanges [4, 3] and 
global optimization methods, e.g., particle swarm optimization [10], simulated 
annealing [11] or evolutionary algorithms [12]. 


This contribution proposes a new approach to optimize the data point distri- 
bution of an LH design based on the use of probability density function (pdf) 
estimation. Thus, contrary to the above-mentioned approaches, the proposed 
method can rate the overall distribution quality. The approach estimates the pdf 
for all data points. In this contribution a local search method that utilizes point 
exchanges is employed. In each iteration the algorithm exchanges a coordinate 
of a point pair, based on the difference of the estimated pdf and the uniform 
distribution. This procedure is compared to other state-of-the-art approaches 
for space-filling designs. The quality of the different approaches is evaluated 
using the Kullback-Leibler (KL) divergence. 


The contribution is structured as follows. Section 2 introduces the most im- 
portant methods to calculate space-filling designs, namely (i) Sobol sequences 
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and (ii) LH designs. Additionally, since LH designs are not necessarily space- 
filling, optimization strategies to achieve space-filling LH designs are introdu- 
ced. Section 3 provides the concept of kernel density estimation and a special 
evaluation strategy for pdfs, before in Sec. 4 the new density-based LH design 
optimization strategy is presented. In Sec. 5 the performance of the algorithm 
is analyzed and compared to other state-of-the-art methods for space-filling 
designs. Finally, a conclusion is given. 


2 Space-filling designs 


The two main properties of a design of experiments (DoE), if no prior kno- 
wledge about the process is available, are the (i) 1Duni property and (ii) the 
space-filling property. This section gives a quick overview of two important 
methods to achieve space-filling designs. 


2.1 Sobol sequences 


The most commonly used strategy to create space-filling datasets is introduced 
in [9]. The so-called Sobol sequences are low-discrepency sequences, which 
are uniformly distributed for N = 2* data points with x € N. The foundation 
to create Sobol sequences is the successive subdivision of each dimension in 
halves and the reordering of the coordinates in each dimension. This procedure 
is computationally very cheap, even for high numbers of data points and high 
dimensions and thus used commonly. On the downside, Sobol sequences do 
not fulfill the 1Duni property. 


2.2 Latin hypercubes 


Originally being used in the field of computer experiments, LH designs fulfill 
the 1Duni property due to their structure, but they do not fulfill the space-filling 
property intrinsically. Therefore, a subsequent optimization of the LH design 
has to be performed, if a space-filling dataset is desired. 
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Figure 1: The first row of subplots visualizes the point distribution, the second row the 1Duni 
property and the last row the space-filling property for N = 32 data points and n = 2. 


To create an LH design, the number of samples N and the dimension n have 
to be fixed by the user. Each input u;,i = 1,2,..., is partitioned into N levels. 
Thus, for N samples and n input dimensions, N” grid points are constructed. 
Out of these N” grid points, N are occupied by data points. Here, each level of 
the N levels is occupied once, establishing the non-collapsing property of the 
LH design. 


2.3 Optimization of Latin hypercubes using deterministic 
local seach 


This subsection focuses on the optimization of LH designs using the deter- 
ministic local search (DLS) [4] and the extended deterministic local search 
(EDLS) [3] algorithms. 
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The procedure of the DLS Algorithm is as follows. A dataset, here an LH 
design, with N data points in n dimensions is given. The goal is to maximize 
the distance of the two nearest neighbors in the dataset. To achieve this, the 
first step is to calculate the nearest neighbor distance between all data points. 
The two data points with minimal distance are labeled as the critical points, the 
rest of the dataset is labeled as potential swap partners. Now the coordinates 
of a point pair are swapped in one dimension. This point pair consisting of one 
out of the two critical points and one swap partner point. If the coordinate swap 
reduces the minimum nearest neighbor distance, the procedure starts again with 
the first step. If not, the algorithm tries all possible dimensions and all potential 
swap partners. If no further improvements can be achieved, the algorithm is 
terminated. This procedure effectively drives all data points away from each 
other, thereby creating a space-filling design. At the same time, the 1Duni 
property is preserved. 


The DLS algorithm has different positive properties. On the one hand, it preser- 
ves the LH design, if the initial set is also an LH design. On the other hand, the 
algorithm can optimize any collapsing design as well. It is deterministic, thus 
the results are reproducible and it ensures improvement in each iteration. 


The idea of the DLS algorithm is amplified in [3] to the extended deterministic 
local search (EDLS). A small modification of the DLS leads to a vast change 
in the algorithm’s behavior. Instead of considering only the smallest nearest 
neighbor distance, all neighbor distances are inspected and maximized. This 
has the effect of even more homogeneous data distributions in comparison to 
the DLS. On the downside, the computing time increases considerably. 


3 Estimation and evaluation of probability density 
functions 


For a n-dimensional dataset consisting of the data points 
u(i) = Juli) u(i) +++ m(i), i = 1,2,...,N summarized in a matrix 
U = [u(1) u(2),...,u(N)] a density estimator can be used to estimate its pdf 


q(u). The pdf is based on placing a kernel on each data point. This procedure 
is known as kernel density or Parzen estimator. In areas, where two or more 
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kernels overlap, the corresponding values are added. In most cases, a Gaussian 
kernel is used. For other kernel types, see [6, 8]. 


The pdf is estimated by 
jets exp (—4[u - ui) TI [u — u(i]) 
(u) = > K(u,u(i)), K(u, u(i = Z—- —— (1) 
Plu) = g LK ual), Kuli) SE 
typrically with a diagonal covariance matrix © = diag(o?, O3, ..., 02) of n stan- 


dard deviations for each dimension. Estimator (1) can be interpreted (i) as a 
sum of N kernels with height one normalized by their integral sum or (ii) as an 
average of N normal distributions. 


One commonly used strategy to determine standard deviations is to use Silver- 
man‘s rule-of-thumb [8] 


1 


4 \ T yo 2 
Oi = Oui “att 
(5) 2) 


with the standard deviation of the data o,; in dimension i. For alternative 
approaches to determine standard deviations, see [7]. 


Typically, Monte Carlo sampling is used to evaluate pdf estimates of data- 
sets. This is a computationally demanding task since the number of sampling 
points has to be high “enough”. In a recent publication, a different, less time- 
consuming strategy to estimate and evaluate pdfs was proposed [5]. Here, a pdf 
is estimated using kernel density estimation and evaluated solely on the data 
points of the dataset itself, to select a subset of an original dataset. Compared to 
Monte Carlo sampling the computational effort decreases dramatically, without 
decreasing the subset selection performance. 


In terms of calculation, this strategy can be visualized using a symmetric N x N 
matrix. Figure 2 shows the combinations of each point of a dataset u used for 
evaluation (rows) with the kernels of the estimated density (columns). The 
dataset’s pdf value evaluated at one data point can be calculated by taking the 
mean of the associated row of this calculation matrix. 
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Figure 2: Pdf visualization in form of a matrix X. 


This calculation matrix additionally offers an interesting opportunity for the 
estimation of pdfs. If a matrix is build-up for a dataset and one data point 
is replaced by a new data point, the pdf can easily be adjusted, instead of 
being forced to recalculate the whole matrix from scratch. One simply has 
to calculate the effect of a kernel, placed on the new data point, on all other 
data points, and vice versa. The effect of the kernel on all data points equals 
the row of the particular data point in the calculation matrix, whereas the effect 
of all kernels on the data point equals the column. Ifthe row and column of the 
new data point are updated, the matrix represents the pdf of the new dataset. 


4 New algorithm 


The proposed algorithm effectively combines elements from the DLS algo- 
rithm and the efficient pdf calculation strategy proposed in [5]. While the DLS 
algorithm maximizes the minimum nearest neighbor distance to find optimal 
LH designs, the new method has a more global approach for the loss function. 
It minimizes the mean absolute error of the difference between the uniform 
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distribution and the estimated pdf of a given LH design. For the flow chart of 
the new algorithm, see Fig. 3. The proposed algorithm proceeds as follows. 


1. Initialize the algorithm with an arbitrary n-dimensional LH design 
U = [u(1),u(2),...,u(N)] consisting of the data points 
u(i) = [ui (i) u2(i) «++ un(i)|", i = 1,2,...,N. 


2. In order to estimate the pdf f(u) of the LH design, kernel density estima- 


tion is used. The N x N calculation matrix X is set up for the LH design 
U. The matrix X is symmetric, since data points and evaluation points 
are the same. Take the mean of all rows of X to receive f(u). 


3. Select the data point with the maximum pdf value as the point to swap. 
This represents the point where data is most dense and therefore should 
be moved away by swapping with a partner point. 


4. Go through all non-selected points i = 1 : N — 1 in each dimension 
ii = 1 : n and initialize a swap matrix X, = X for each point-dimension 
combination and an associated matrix U, = U to store the altered LH 
design. Swap the coordinates of the swap point and point 7 in dimen- 
sion ii. Instead of recalculating the whole pdf, this merely equals a re- 
calculation of two rows and two columns of X, and the update of two 
points of U,. Calculate the pdf at all data points. Calculate the error 
E(i,ii) = X} (\1— p(w(k))|). For N — 1 x n swaps, this results in the 
N — 1 xn error matrix E. 


5. Execute the swap with the minimal value of the error matrix E. Update 
the corresponding matrices X = X, and U = U, as well as 


p(u) = p(u,). Terminate the algorithm if no improvement is possible, 
otherwise continue at step 3. 


5 Performance analysis 


In this section, the performance of the new density-based LH design optimi- 
zation approach is analyzed and compared to other state-of-the-art methods. 
For this purpose, space-filling designs for different dimensions (n = 2,...,5) 
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Figure 3: Procedure of the new algorithm. 
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and numbers of data points (N = [20,40, ...,200]) are compared. To assess the 
space-filling qualities of the methods, the Kullback-Leibler (KL) divergence 
between the estimated pdf of the datasets and the uniform distribution using 
Monte Carlo sampling with 10000 random points is calculated. Lower KL 
divergence values indicate a higher similarity of two pdfs. For a KL divergence 
value of zero two pdfs are identical. Furthermore, the computation time is 
investigated. 


5.1 KL divergence evaluation 


Figure 4 shows the result of the evaluation of the KL divergence over the 
number of data points for n = 2,...,5. For each investigated dimension, the 
proposed density-based LH design optimization consistently produces the best 
results. In most parts of the evaluation, the EDLS achieves the second-best re- 
sults, followed by the DLS algorithm. The Sobol sequences exhibit the critical 
overall KL divergence scores and thus the critical space-filling datasets. 


The difference between the diverse approaches can be attributed to the em- 
ployed optimization criteria. The (extended) deterministic local search opti- 
mizes a nearest-neighbor-based criterion, thus only distances between pairs 
of data points are considered. On the contrary, the density-based LH design 
optimization directly optimizes the pdf of the LH design. Because of this global 
optimization approach, the density-based LH design optimization is more po- 
werful and therefore able to achieve better KL divergence performances. Sobol 
sequences have the most simplistic structure out of the four analyzed methods. 
Furthermore, Sobol sequences work best if N = 2* with x € N. Thus, in a 
separate case not shown in this contribution, the four methods were analyzed 
for N = [16,32,64, 128,256]. The performance differences were similar to the 
presented case. 


5.2 Calculation time evaluation 


Figure 5 shows the evaluation of the calculation time in dependence on the 
number of data points for different dimensions. Since the calculation time of 
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Figure 4: Comparison of KL divergences over different dimensions and number of data points. 
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Sobol sequences is negligibly small, it is left out of this particular evaluation. 
The density-based LH design optimization shows the highest calculation times, 
increasing with both the number of data points and the number of dimensions, 
followed by the EDLS and, with a big difference due to the simpler structure, 
the DLS. Furthermore as a reference several higher dimensional cases have 
been investigated. In this test case the calculation time for the new algorithm 
was 32 minutes in 7D and 69 minutes in 10D, both with N = 100 data points. 


The computational effort of the density-based LH design optimization can be 
attributed to the time-consuming calculations of the pdfs for every possible 
point switch in each iteration. An alteration of the algorithm in steps 4 and 5, 
compare Fig. 3, could make it more efficient. In the proposed version, i runs 
over all N — 1 data points and executes the swap with the lowest overall error. 
Alternatively, rather than performing the best swap, the algorithm could run 
until the first swap improves the error. This alteration will be subject to further 
research. 


5.3 Extension to a stochastic algorithm 


The density-based LH design optimization always performs the best possible 
point switch. Therefore, it represents a local search method. A simple, straight- 
forward stochastic extension is presented to circumvent this shortcoming to 
possibly increase the algorithm’s performance even more. On the downside, 
the convergence of the algorithm is not strictly decreasing. 


The only necessary change takes place in step 5, compare Fig. 3. Instead of 
executing the best swap resulting in the best possible quality, a swap is executed 
with a probability proportional to the size of a quality measure. Here, the 
inverse mean absolute error is used as the measure of quality. This stochastic 
element helps the algorithm to escape bad local optima, see Fig. 6. Here, the 
density-based LH approach stops after around 100 iterations, because no swap 
leads to further improvement of the quality. In comparison, the density-based 
LH with stochastic extension is terminated after around 400 iterations through 
an artificial criterion, which terminates the algorithm if no improvement is 
achieved for 20 iterations. 
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Figure 5: Comparison of computation times over different dimensions and number of data points. 
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Figure 6: Exemplary comparison of the loss function progress over the iterations for N = 120 and 
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Figure 7: Comparison of KL divergences over the number of data points. 
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Figure 8: Comparison of calculation times over different dimensions and number of data points. 


As Fig. 7 shows, the impact of the stochastic extension is visible in 2D. Since 
the influence in dimensions 3 to 5 is negligible, the plots are left out. While 
the improvement is moderate to non-existing in terms of performance, it is 
significant in terms of computational demand, see Fig. 8. In the perspective of 
the author, the almost non-existing advantage in terms of performance is not 
worth the impact on the computational demand. 


6 Conclusion 


In this contribution, a new approach to optimize LH designs, based on the 
estimation and evaluation of pdfs, was presented. The algorithm minimizes the 
mean absolute error or any other criterion between the estimated pdf of the LH 
design, evaluated solely on its data points, and the uniform distribution. With 
this structure, the algorithm is capable of swapping data points of an LH design 
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to achieve a uniform point distribution. Alternative optimization strategies in 
combination with the new proposed criterion are also possible. 


In order to validate the functioning of the new approach, it was compared 
to other state-of-the-art methods to create space-filling designs. The met- 
hods were compared using the KL divergence of the resulting datasets and 
the uniform distribution, as well as the resulting calculation times. Overall, 
the KL divergence performance of the new approach was outstanding, the new 
algorithm outperformed the other methods significantly. However, in terms of 
computation time, the new approach is expensive. 


A possible alteration of the algorithm is to reduce computation time by exe- 
cuting not the best but the first point swap which achieves an improvement 
of the loss function. Furthermore, the global loss function opens up different 
interesting research topics. A way to implement boundaries or constrains in 
the input space are promising subjects for future research, as is the possibility 
to optimize a design to any desired distribution, not just the uniform one. 
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Abstract 


A multi-agent system (MAS) consists of a group of agents that solve a common 
task through cooperation. Many problems arising in this setting can be formu- 
lated as distributed constrained optimization. In recent work, we considered 
the unconstrained version of the problem. In particular, we developed a theory 
to understand distributed gradient-based optimization methods, wherein the 
local (state) information is communicated via a lossy wireless network. A key 
contribution of the theory is that the information delay could be unbounded, 
however, it does not consider constraints. In this work, we present preliminary 
experimental results aimed towards extending the aforementioned work to the 
constrained setting. First, the constrained optimization problem is transformed 
into an unconstrained one using the penalty-based method. Then, we employ 
the distributed gradient approach from our previous work to solve the uncon- 
strained optimization in a decentralized manner. The illustrative experiments 
are based on autonomous pattern formation tasks for robotic swarms. The 
(simulated) robots cooperate to form a specified pattern (line, circle), with 
the constraint that the distances between neighboring robots equal a given 
constant. 
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1 Introduction 


A multi-agent system (MAS) is typically large-scale in nature, and a wireless 
communication network is used to connect the various agents involved, due to 
its convenience and cost. Examples of MAS include wireless sensor networks 
and smart grids, see [3]. Many problems that arise in these systems can be cast 
as constrained optimization problems that need to be solved in a distributed de- 
centralized manner [6]. For example, in smart grids, a group of controllers has 
a common objective to minimize the control errors in terms of AC frequency or 
to maintain voltage levels in the whole grid with time-variant loads or energy 
sources. The controllers cooperate to solve this problem under constraints on 
the system state. 


The literature on distributed algorithms to solve constrained optimization pro- 
blems is rich, see e.g. [1]. However, they typically assume that the delay asso- 
ciated with the transfer of information from one agent to the other is bounded. 
Failed transmissions and channel delay are two main factors that contribute 
to information delay. In this paper, we focus on information delay due to 
failed transmissions. We study the effect of unbounded information delay on 
distributed algorithms for constrained optimization. In the past, unbounded 
information (update) delays were studied within the setting of unconstrained 
optimization in [4, 5]. 


The global objective is formulated in terms of a differentiable function. The 
agents solve this objective, together, by searching appropriate local subspaces 
via gradient steps. The solution to the global problem is obtained by putting 
together the distributed solutions. For local gradient calculations, at every step, 
the agents require information from other agents. Furthermore, each agent has 
to optimize subject to some local constraints. To this end, we use the penalty 
method to transform the constrained problem into an unconstrained one. In 
other words, the distributed gradient updates of each agent is augmented by a 
penalty term that encodes the violation of local constraints. It may be noted 
that the associated penalty hyper-parameter is increased over time. Since the 
communication channel is lossy, the information from the peers may be de- 
layed and the agent is therefore forced to carry out update steps using outdated 
information. In [4], mild requirements on the quality of the wireless network 
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are presented, which ensure that using outdated information does not hinder 
convergence. In this paper, we conjecture and present preliminary numerical 
results which suggest that similar conclusions can be drawn even in the pre- 


sence of constraints. 


To illustrate the ideas, we consider pattern forming tasks for robot swarms as 
an application. To this end, the specified pattern (line, circle) is expressed as 
an objective function. The objective is constructed, such that the minimum of 
the objective is reached when the robots arrange in the pattern. The objective 
function is evaluated using all robot positions, and the distances to neighboring 
agents constitute the local constraint set for every agent. At every time step, 
each robot moves in accordance to the local gradient update. To calculate this 
gradient, it uses the last known position of the other robots in the swarm. Since 
robot positions are communicated using lossy channels, the last known position 
may be outdated. In our experiments, we assume that the robots are ordered 
and communicate their knowledge of the swarm, only with direct neighbors in 
the chain. 


2 Problem Definition 


Broadly speaking, we have m agents that aim to minimize a given global 
objective function while satisfying local constraints. In other words, the agents 
cooperate to find: 


z" = argmin Bg [J(2,6)], D 
s.t. Gi= {gk <SO|I<Sk<k} 1<i<m, 


where z* = (Z],...,z;,) such that z* is the component of the minimum that is 
calculated by the i® agent a;, Gi = {giz | 1 < k < ki} is the local constraint set 
of a; containing k; inequality constraints. The stochastic objective function, 
J : R” x S —> R, is such that z = eee € R” where z; is the local 
variable associated with a;, and € is an S-valued random variable. In typical 
applications, S is some compact subset of RÝ, k > 1, or RÝ itself. Please note 


that we allow for general vector-valued Zis. 
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The reader may note that the local constraint set G;, of a;, is not visible to a; 
for j Æi. In other words, the agents are only aware of their local constraints, 
not that of others. Since we use the penalty-based method to transform the 
constrained optimization problem into an unconstrained one, we may associate 
each G; with the following penalty function: 


kj 


P,(z) = £ max (0,8x(2))”. (2) 


k=1 


2.1 Communication Model 


As stated earlier, the agents are connected using a wireless communication 
network. We model this using a weighted directed graph G = (V,E). In this 
graph, each agent is represented as a node and a directed edge ej; = (a;,a;) 
exists if a; can directly transmit messages to aj, possibly using a dedicated 
unidirectional channel. The edge-weights (€ [0,1]) represent the probability 
of successful transmission along that edge. We assume that the transmissions 
along different edges are independent, i.e., there is zero interference. We allow 
for graph evolution, provided it is connected at all times. In particular, at any 
point in time, there exists a path connecting a; to a; such that the product of the 
edge-weights (success probabilities) is strictly greater than zero, 1 < i, j < m. 
Hence, there is a chance that the message sent by a; reaches aj. 


To find a solution, z* = (zj,...,z7,), to the above described constrained op- 
timization problem, a; searches for z; in its local search space R” using the 
following gradient formula: 


OZ; OZ; 


(3) 


where 2 is the partial derivative with respect to the variable z; , B is the 
l 


m 
penalty parameter, and Yn; =n. In order to calculate u a; requires updates 
i=l i 
from aj, j # i. This information is exchanged using the underlying wireless 
communication network, which causes delays. In this paper, we consider the 


following sources of information delays: 
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e packet losses; 


e routing through other agents in the system, due to 
the lack of direct connection. 


Note that we do not consider channel delays in this paper. However, we believe 
that our ideas may be readily extended to incorporate, possibly unbounded, 
channel delays. The delays directly affect the age of the information available 
to an agent. In this paper, we use the term information delay and age of 
information, interchangeably. 


The gradient calculation in (3) deals with information delays, by using the 


latest available updates from other agents in the system. 


2.1.1 On unbounded information delays 


Let us suppose that there are no packet losses. The delay due to indirect routing 
grows linearly as a function of the distance between the nodes in the graph. We 
assume that the diameters (maximum distance between any pair of nodes) of 
the evolving graphs are bounded, independent of time. Hence the delay due to 
indirect routing is also bounded. If we now consider packet loss, then updates 
within any bounded time-frame cannot be guaranteed. Hence, packet loss is 
the major contributor to information delay. The probability that a; successfully 
communicates with a; within any d time-step interval is some p > 0, where 
d is the above mentioned bound on the graph diameter. Note that p may vary 
over time. Hence, the event of unsuccessful communication over successive 
d length intervals is geometrically distributed. In other words, there is no 
absolute bound on the information delay. 


3 Algorithm 


We are now ready to present an algorithm to solve (1). It may be noted that it is 
based on penalty-based gradient descent methods for the centralized version. In 
the setting considered here, agent a; updates z; in an iterative manner, through 
gradients calculated using the latest available zj, j Ai. At any time t, a; 
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Algorithm 1.: Distributed Optimization 


1: Initialize a; with z 

2: for all time-step do 

3: process received Zr 

4 update zi 

5: for all a; : (a;,a;) € E do 
6: send 2} to aj 


maintains a local view, 2, of the global variable z’. Formally speaking, the 


: p x . At °¢ im (t 
local view of a; at time t is given by ZU = Gee, fanz. m ) 


T 
, where 
O< t,(t) <t, and t — %;(t) is the age of the information from aj available to 


a; at time t. The local z; is updated as follows: 


dS (2,6) OP; (2’) 
33; +0, - , (4) 


zt! -z-n() 


where n (t) is the learning rate and f (t) is the time-varying penalty parameter. 
&' are statistically independent samples that have the same distribution as &. 


3.1 Information exchange 


At time t, a; sends a to its neighbors. Simultaneously, it receives zis from a 
subset of its neighbors (some may be lost due to packet drops). It uses the obtai- 
ned information, and 2 to update Z to a , such that it contains the latest vari- 
ables associated with other agents. To facilitate consistent updates, we assume 
that each variable is associated with a time stamp. Hence, at time-step t, a; 
receives 2’; along with the vector of time stamps Wetten): 
from a subset of its neighbors. Using the obtained information, a; updates 
the entries of 2} by comparing time stamps. In other words, a; checks to 
see if T(t) > T(t), if yes, then updates 2;(k) to 2(k). It also updates the 
corresponding entry in t;. Otherwise, old entries are retained. Note that 2! (k) 
is used to represent the information that a; has of az, at time t. Subsequently, 
a; executes update step (4). Finally, the agent sends its updated ya along with 
the updated time stamps to all its neighbors. This allows, agent a; similarly 


to discard outdated updates. The reader must note that no retransmissions are 
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triggered in case of failed transmissions since we do not assume that received 
packets are acknowledged. 


The discussion in this section has been codified in Algorithm 1. 


3.2 On the convergence of Algorithm 1 


We believe that a proof of convergence of the algorithm will proceed along 
similar lines as the analysis in [4]. Here, we studied the unconstrained version 
of Algorithm 1. Suppose the random variables associated with the age of infor- 
mation, at every time-step, have bounded second moments, then it is shown, in 
[4], that the associated errors are asymptotically in the order of the learning 
rate. Since the learning rate diminishes to zero, the effect of information 
delays vanishes asymptotically. Also, that the distributed algorithm has the 
same asymptotic properties of a centralized one. 


A sufficient condition on the wireless network to ensure the above mentioned 
bounded second moment requirement is stated as assumption (A6) in [4]. It is 
restated below for our setting: 


e for each pair of agents, there is a non-zero probability of 
successful transmission of the routed package, 


e and the transmission probabilities of all edges are 
statistically independent. 


As discussed in Section 2.1, the communication graph is always connected. 
Hence the above statement conditions are readily satisfied. 


Now, we discuss the influence of the penalty parameters on convergence. In 
our algorithm, we do not use a constant ß, rather we take B(t) f œ. This is 
done to avoid the scenario wherein the algorithm converges to z” such that: 


VI(z°) + B > VPi(z”) = Oand 
i=l 


V/(z”) 


-BÝ VP(2") £0. 
i=l 
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This phenomenon is explained in the literature of centralized penalty-based 
method. As stated earlier, we can use the arguments from [4] to conclude that 
Algorithm 1 has the same long-term behavior as its centralized counterpart. In 


m 
other words, f(t) + œ is important to ensure that VJ(z°) = } VP;(z”) = 0, as 
i=l 


desired. 


However, the main issue with a time-varying penalty parameter that goes to 
infinity is the growth in the variance of the descent directions. Hence, a 
diminishing learning rate is required to counteract this. More precisely, n (t) 
and $ (t) are chosen such that 


This condition is inspired by a similar assumption, A1, in [7]. Intuitively, it is 
clear that a condition such as n(t)B(t) — 0 is required. However, it is shown 
in [7] that is not always sufficient. 


4 Experiments 


In this section, we present the results of two experimental studies, in which 
mobile robots are simulated as points in the two-dimensional Euclidean space!. 
The two scenarios optimize for a different objective function and have slig- 
htly different penalties, while the communication model is the same for all 
experiments. A simple packet reception probability model is used to calculate 
the success probability of transmission as a function of robot distance. To 
ensure connectedness of the communication graph, the minimal transmission 
probability is bounded as follows: 


1 
h—d 
pra(d) = clip (F=F.001, 1) > EN: 


U https:/github.com/stheid/DDSCO 
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The parameters h and / determine the thresholds of maximum and minimum 
transmission probability, respectively. 


The general setting of the following scenarios are robot-swarm pattern forma- 
tion tasks. Each agent is represented by a point z; = (x;,y;)” € R?. In the first 
scenario, the agent’s objective is to form a line on which the agents evenly 
space out. In the second scenario, the agents should form a circle and also 
create a constant distance to each other. The inter-robot distance is modeled by 
the constraints, while the created structure is modeled in the objective function. 
Initially, the agents are placed uniformly at random in a quadratic region. 
The vector z is the concatenation of local position vectors z; of all agents 
ai. Additionally, we denote x = (x1,...,%m)! and y = (y1,.-.,Ym)’. In our 
experiments, we simulate a swarm of m = 10 robots. 


4.1 Scenario 1: Forming a Line 


The objective function for forming a line consists of two parts. The first 
component is the residual error of the ordinary least squares (OLS) regression 
over all points. The second part is the distance between the first and last point. 
Maximizing the distance along with minimizing the residual error encourages 
the robots to unravel if they form a folded line. 


The x-term of each position is augmented by a constant for fitting the bias 
term and a random value to make the objective more challenging: @(x;) = 
(1,%,%;) with % ~ N (u = 0,0 =0.1). For ease of notation we define ® = 
((x1)",...,0(Xm)")? as the transformation of x. Formally the objective is 
defined as follows: 


T 2 
e'e ||z1 —Zm|| 
J(x,y) = n2 a n m 
e=y—&b 


b = (Tp) "y 


Here, b is the OLS regression line and e is the residual error of the regression 
estimate. Since the regression error is in the order of the magnitude of the 
squared number of nodes, it has been normalized accordingly. Similarly, the 
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distance between the first and last agent of a chain is normalized by the number 
of agents. 


Each agent has up to two local constraints. The first and last agent have 
only one neighbor, therefore, they will only have one constraint. All other 
agents have two equality constraints to maintain a constant distance to their 
neighbors. 


More precisely, agent i has the following constraints: 


git (z) = |do- |z; — z1 ||°| =0 ifi>1 
gi2(Z) = |do- |zi- zi41||°| =0 ifi<m 
|? 


with do the demanded inter agent distance and || the Euclidean distance. 


4.2 Scenario 2: Forming a Circle 


Several objective functions could be chosen to form a circle. A quite obvious 
one would be to make the agents maximize the area of the polygon they span. 
Correctly calculating the area of arbitrary polygons is a non-trivial task, as self- 
intersecting polygons pose additional challenges [2]. Fortunately, the equation 
for calculating the area of simple polygons can be used as a lower bound for 
the actually covered area. Therefore, it is sufficient to use this equation for 
arbitrary polygons in our case, since we aim for maximization: 


n—1 
xy — Xiri with xo = Xm and yo = Ym 
i=0 


J (x,y) T 


For simplicity, the constant prefix has been omitted, as it will not influence the 
position of a minimum. Since the algorithm is purely guided by the gradient, 
it might converge to local optima. 
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The constraints are similar to the previous example, however, the first and last 
agent will now be considered as neighbours: 


gii(2) = |do — ||zi —2;-1||°| =0 


8i2(Z) = |do — zi — zi ll?| =0 with Zo = Zm 


The hyperparameters for learning rate and penalty scale are chosen as fol- 
lows: 


t + 30000 


Bi) =14 z5 


The learning rate is therefore decreasing in a inverse proportional manner, 


while the the penalty term goes to infinity in the square root of the timestep. 
The constants in the equation were tuned by hand to achieve fast conver- 
gence. 


4.3 Results 


In both of the above experimental scenarios, we observed convergence to at 
least a local minimum, while satisfying all local constraints. For the line 
scenario, the robots eventually arrange on a line, however, sometimes the line 
is folded into itself, which will not maximize the distance, but minimize the 
regression error. For the circle scenario, we similarly see the forming of per- 
fect circles or shapes like spirally intersecting circles, which represent a local 
maximum of the covered area. 


The rate of convergence was observed to strongly depend on the quality of the 
wireless network (transmission success probabilities). Figure 1 illustrates the 
evolution of the algorithm, tasked to form the circle, in the second scenario. 
In the beginning, the agents were severely penalized due to large inter-agent 
distances. This causes them to move towards each other, and form a closely- 
knit cluster. In the next phase, the agents try to form a larger circle, due to 
the design of the objective function. In the final phase, the increasing penalty 
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Figure 1: Snapshots of different phases of scenario 2. Each dot represents the position of a robot, 
the communication link is visualized as the blue line. The optimal solution is indicated 
with the black circle. 


parameter forces the agents to move closer to each other, to fulfill the distance 
constraint. 


4.3.1 Impact of Communication Quality on Rate of Convergence 


The quality of the wireless network seems to affect the rate of convergence. 
To illustrate this, we experimented with different success probabilities (net- 
work qualities). Empirical results suggest that a strong correlation cannot be 
directly seen. This is because aged information about the peer’s positions allow 
constraints to be violated more freely in some cases. In other words, bad com- 
munication may allow for a streamlined convergence to a wrong minimum (for 
e.g., does not satisfy constraints). Loosely speaking, old information facilitates 
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Figure 2: Convergence progression in scenario 2 after about 2000 update steps. Left picture shows 
convergence with less reliable communication channels and hence slower convergence. 


a more localized optimization and constraint satisfaction, and updated infor- 
mation brings back a clearer picture of the global constrained optimization. 
Figure 2 shows the convergence of two identical configurations of scenario 2, 
which only vary in the communication channel quality. The optimal solution 
subject to the distance constraints is reached when all agents arrange on the 
black circle, in an equally spaced manner. The figure to the right illustrates the 
scenario with good communication, and the figure on the left the bad one. The 
algorithm has almost formed a circle, under good communication, only some 
constraint violations are left to be addressed. In the case of bad communication, 
the algorithm has not yet formed a circle and is hence lagging. 


4.3.2 Constraint Satisfaction 


Let us consider the first scenario, wherein the agents try to form a straight 
line while maximizing the spanned distance of the collection. However, the 
constraint requires that a given inter-agent distance be achieved. With small 
penalty scaling factors, the constraint is very loose and can easily be violated. 
Therefore, the penalty needs to be scaled continuously to allow for arbitrarily 
small constraint violations. The continuous increase of penalty allows a smooth 
transition from the unconstrained to the fully constrained scenario. Similarly, 
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in the second scenario, the agents aim to form a circle of area. Therefore, 
drifting outwards, however, again inter-agent distances must be maintained. 
Again, the penalty term needs to grow to infinity, to dominate the overall 
penalized objective when the solution does not satisfy constraints. In particular, 
growing the penalty parameter prevents convergence to a point that satisfies 


Vf =-BVEP £0. 


The experimental results are in agreement with the general argument of con- 
vergence stated at the end of Section 3. 


Videos showing the evolution of the convergence can be found in the repository 
https://github.com/stheid/DDSCO. 


5 Conclusion 


We considered the problem of distributed constrained optimization with sto- 
chastic objective and inequality-type constraints. To solve this problem, we 
presented a penalty-based distributed gradient algorithm. We presented preli- 
minary empirical results to support the conjecture that results from [4] naturally 
extend to the inclusion of constraints. In particular, that the convergence, in the 
presence of local constraints, is unaffected by stochastic information delays 
with bounded second moments. 


For visualization, we investigated two scenarios of pattern formation in robot 
swarms. The objective function was used to specify the pattern, subject to 
the inter-robot distance constraints. The agents collectively minimized the 
objective function by searching in appropriate subspaces. In the experiments, 
we observed that the convergence speed correlates with the quality of the com- 
munication channel, although a more thorough investigation is needed to paint 
a clearer picture. Also, we look forward to analyzing the setting in a more 
formal way to prove the convergence theoretically. 
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1 Introduction 


While tree-based Genetic Programming (GP) [1] is often used with crossover, 
Cartesian Genetic Programming (CGP) [2] is mostly used only with mutation 
as the sole genetic operator. In contrast to comprehensive and fundamental 
knowledge about crossover in tree-based GP, the state of knowledge in CGP 
appears to be still ambiguous and ambivalent. Two decades after CGP was 
officially introduced, the role of recombination in CGP is still considered to be 
an open and remaining question. The state of knowledge about crossover in 
CGP has been recently surveyed and the role of crossover is still considered to 
be an open and remaining question [3]. Even if some progress has been made in 
recent years, comprehensive and detailed knowledge about crossover in CGP 
is still missing [3]. A promising step forward was made by the introduction 
of the subgraph crossover [4] but this technique has not been comprehensively 
studied in the past. Therefore, this work follows up former work on the cro- 
sover question by investigating if the search performance of CGP algorithms 
that utilize the subgraph crossover can be more efficient as the commonly 
used mutation-only CGP on a set of well-known benchmark problems. This 
short paper provides an overview of the full paper version [5] of the presented 
work. 
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1.1 Cartesian Genetic Programming 


In contrast to tree-based GP, CGP represents a genetic program via genotype- 
phenotype mapping as an indexed, acyclic, and directed graph. The CGP deco- 
ding procedure processes groups of genes and each group refers to a function 
node of the graph. The last genes of the genotype represent the outputs of the 
phenotype. Each node is represented by two types of genes which index the 
function number in the GP function set and the node inputs. These nodes are 
called function nodes and execute functions on the input values. The number 
of input genes depends on the maximum arity na of the function set. Given 
the number of outputs no, the no last genes in the genotype represent the 
indices of the nodes, which lead to the outputs. A backward search is used 
to decode the corresponding phenotype. An example of the backward search 
of the most popular one-row integer representation is shown in Figure 1. The 
backward search starts from the program output and processes all nodes which 
are linked in the genotype. In this way, only active nodes are processed during 
evaluation. The genotype in Figure | is grouped by the function nodes. The 
first (underlined) gene of each group refers to the function number in the 
corresponding function set in the figure. The integer-based representation of 
CGP phenotypes is mostly used with mutation only. The number of inputs nj, 
outputs no, and the length of the genotype is fixed. Every candidate program is 
represented with nr * ne * (na + 1) + no integers. CGP is traditionally used with 
a (1+A) selection scheme of evolutionary algorithms. The new population in 
each generation consists of the best individual of the previous population and 
the A created offspring. The breeding procedure is mostly done by a point 
mutation that swaps genes in the genotype of an individual in the valid range 
by chance. 


2 The Subgraph Crossover Technique 


The subgraph crossover technique for CGP is inspired by the subtree cros- 
sover found in tree-based GP. To recombine two directed acyclic graphs, the 
subgraph recombination is performed by respecting the CGP phenotype. The 
phenotype of each individual is represented by the active path of the graph 
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Genotype 010/121 
2 


Phenotype 


Index Function 


Function Addition 


0 
Lookup Table 1 Subtraction 
2 Division 


Figure 1: Example of the decoding procedure of a CGP genotype to its corresponding phenotype. 
The nodes are represented by two types of numbers which index the number in the 
function lookup table (underlined) and the inputs (non-underlined) for the node. Inactive 
function nodes are shown in gray color. The identifiers IP1 and IP2 stand for the two 
input nodes with node index 0 and 1. The identifier OP stands for the output node of the 
graph. 


and is determined through the evaluation process. Furthermore, the active 
path of a graph leads to the semantic value of a certain individual in CGP. 
As a consequence, the subgraph crossover exclusively recombines the genetic 
material of the active paths. The idea of the subgraph crossover is that it should 
reduce the disruption which is caused by the genotypic single-point crossover 
in standard CGP and truly recombine subgraphs. 

For the description of the subgraph crossover procedure, let n; be the predefined 
number of input nodes and let nr be the predefined number of function nodes. 
In CGP, the input nodes are indexed from n; to n; — 1 and the function nodes of 
each graph are indexed from 0 to nj + ng — 1. The nodes which lie between the 
input and output nodes are denoted as function nodes. The crossover is done 
with two parents which are denoted as P| and P). For the crossover procedure, 
the node numbers of the active function nodes are necessary. The node numbers 
of the active nodes of P; and P, are stored in two arrays Mı and M2. The active 
nodes are determined by the backward search in the evaluation procedure. 

To define one suitable crossover point, we define two possible crossover points 
Cp, and Cp? of the two parents. With information about the active nodes and the 
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length of the path, we can choose two possible crossover points. The possible 
crossover points Cp, and Cp2 are chosen by chance in the range of the active 
function nodes which are stored in Mı and M2. The possible crossover points 
may not be input or output nodes. A general crossover point Cp is defined 
by choosing the smaller crossover point from Cp; and Cp2. The reason for 
this is that the subgraphs of the parents, which will be placed in front of or 
behind the crossover point of the offspring’s genome should be balanced. The 
representation of CGP allows active paths of an individual, which can start in 
the middle or back of the graph. The subgraph which will be placed in front of 
the crossover point has to start at more leading active nodes. If Cp is defined as 
the possible point Cp;, the subgraph of P; in front of Cp will be placed in front 
of Cp in the offspring genome. The subgraph behind Cp of P will be placed 
behind Cp in the offspring genome The crossover procedure produces a new 
genome that represents the offspring involving the phenotypes of both parents. 
In the case that two children should be produced, the crossover procedure 
is performed twice with two different general crossover points. Since the 
representation of CGP provides connections to any of the previous function 
nodes of the graph, performing only the neighbourhood connect could result in 
a monotone data flow of the resulting phenotype. An example of the crossover 
procedure is illustrated in Figure 2. 


3 Experiments and Findings 


We performed experiments in the problem domain of symbolic regression and 
Boolean function learning. To evaluate the search performance of the tested 
algorithms, we measured the number of fitness evaluations until the CGP al- 
gorithm terminated successfully (fitness-evaluations-to-success) and the best 
fitness value which was found after a predefined number of generations (best- 
fitness-of-run). We investigated a diverse set of popular GP benchmarks, inclu- 
ding single and multiple output problems. The problems are listed in Table 2 
and 3. We used a minimizing fitness function in all experiments. For the 
symbolic regression problems, the fitness of the individuals was represented by 
a cost function value. The cost function was defined by the sum of the absolute 
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Figure 2: Example of the subgraph crossover technique. The subgraph crossover basically works 
similar to the single-point crossover except that the active nodes on both sides of the 
crossover point are preserved. The crossover point is chosen in a way that it is located 
between active function nodes. At the top of the figure, the arrays with the active nodes 
and crossover points are listed. Below this information, the genotypes and phenotypes of 
the parents and the offspring are shown, and the parts of the crossover are marked with 


dashed boxes. 


Pop = 6 


Function Lookup Table 


Mp = {2,3,5,6} Index Function 
0 e 
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Neighbourhood Connected Edge 
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Random Connected Edge 


Proc. 30. Workshop Computational Intelligence, Berlin, 26.-27.11.2020 


difference between the real function values and the values of an evaluated in- 
dividual. For the boolean parity even problems, the fitness was represented by 
the number of fitness cases for which the candidate solution failed to generate 
the correct value of the Even-Parity function. To evaluate the fitness on the 
multiple output problems, we defined the fitness value of an individual as the 
number of different bits to the corresponding truth table. 

In addition to the mean values of the measurements, we calculated the standard 
deviation (SD) and the standard error of the mean (SEM). The algorithms 
which were used in our study are listed in Table 1. The best parameter confi- 
guration for each algorithm and problem has been determined with the help of 
meta-evolution. To classify the significance of our results, we used the Mann- 
Whitney-U-Test. The mean values are denoted a’ if the p-value is less than the 
significance level 0.05 and a* if the p-value is less than the significance level 
0.01 compared to the (1 + 4)-CGP. Note that the mean values are only denoted 
with the significance level marker if the result of a certain algorithm is better 
than the result of the (1 +4)-CGP. We performed 100 independent runs with 
different random seeds. 

Table 4 and Table 5 show the results of the algorithm comparison in the Bool- 
ean domain. As visible, the Canonical-CGP and the (u +A)-CGP perform 
better than the mutation-only CGP algorithms on various problems. The results 
of our experiments in the symbolic regression domain are shown in Table 6, 
Table 7 and Table 8. It is visible that the Canonical-CGP algorithm performs 
better than the mutation-only CGP algorithms on all tested problems. 

The experiments demonstrate that the subgraph crossover can contribute to the 
search performance by using a canonical GA or (u + A)-strategy. Moreover, 
the results of our experiments indicate that the predominance of the (1 + 4)- 
CGP and (1 +A)-CGP algorithms cannot be generalized in the Boolean dom- 
ain. The experiments in the symbolic regression domain indicate that the use 
of the subgraph crossover is beneficial for the use of CGP and can contribute 
significantly to the search performance in this problem domain. Especially 
the search performance of the Canonical-CGP algorithm was superior to the 
(1 +4)-CGP on all tested problems in the symbolic regression domain. 
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Table 1: List of the CGP algorithms 


Identifier Description 

(1 +4)-CGP Traditional (1 + 4)-CGP algorithm 
(1+A)-CGP Traditional (1 + A)-CGP algorithm 

(u +A)-CGP (u +A)-algorithm with subgraph crossover 


Canonical-CGP 


Canonical genetic algorithm (GA) with 
tournament selection and subgraph crossover 


Table 2: Symbolic regression problems 


Problem Objective Function Vars 
Koza-1 X+ +3? +x 1 
Koza-2 523 4x 1 
Koza3 2 x®— 2x4 +x? 1 
Nguyen-4 x6 +x ++ 427 4% 1 
Nguyen-5 sin(x*) cos(x) — 1 1 
Nguyen-6 sin(x) + sin(x +x? 1 
Nguyen-7 In(x+1)+In(x* +1) 1 
Keijzer-6 Yilji 1 
Pagie-1 =-1/(1+x-4)+1/(1+y 4) 2 
Table 3: Boolean function problems 
Problem Number of Inputs Number of Outputs 
Parity-Even 3 3 1 
Parity-Even 4 4 1 
Parity-Even 5 5 1 
Parity-Even 6 6 1 
Parity-Even 7 7 1 
Adder 1-Bit 3 2 
Adder 2-Bit 3 3 
Subtractor 2-Bit 4 3 
Multiplier 2-Bit 4 4 
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Table 4: Results for the Boolean single-output problems evaluated by the number of fitness evaluations (FE) to termination 


Problem Algorithm Mean FE SD SEM 1Q Median 3Q 

Parity-Even-3 (1+4)-CGP 3177 3417 +343 1246 2136 3760 

y (1+A)-CGP 2495 2919 +293 846 1534 2872 

Canonical-CGP 3107 3070 +307 1201 2104 3907 

(u+A)-CGP 1565+ 1517 +152 602 1168 1892 

Dad (1+4)-CGP 15420 14152 +1422 6292 10358 17726 

y (1+A)-CGP 16523 19168 +1926 6095 11276 18557 

Canonical-CGP 54967 47042 +4727 24813 40612 = 71851 

(u+A)-CGP 11135# 8447 +845 5117 8527 14085 

: (1+4)-CGP 45542 33947 3411 21524 36834 61222 
Parity-Even-5 

(1+A)-CGP 34375* 28146 +2828 20685 27104 38941 

Canonical-CGP 28413* 25538 +2566 23388 19640 34876 

(u +A)-CGP 43476 2055 +1022 23814 36188 57182 

; (1+4)-CGP 199989 142915 +14291 107418 163234 242573 
Parity-Even-6 

(1+ A)-CGP 118768* 73682 +7368 65766 91577 156639 

Canonical-CGP 242986 161762 +16257 134518 200196 309346 

(u+A)-CGP 110158* 75163 +7516 63908 90676 135148 

Parity-Even-7 (1+4)-CGP 478055 301113 +30111 268210 393362 605372 

yN (1+A)-CGP 441857 328539 +32853 226272 352254 545197 

Canonical-CGP 631568 548180 +54818 293613 453204 750792 

(u+A)-CGP 3584207 246131 +24613 189278 303988 451667 
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Table 6: Results for the algorithm comparison for the problems Koza 1, 2 & 3 evaluated by the number of fitness evaluations (FE) to termination 


Problem Algorithm Mean FE SD SEM 1Q Median 3Q 
Koza-l (1+4)-CGP 8675635 16681422 +1668142 441477 1814344 7045961 
(1 +A)-CGP 7370880+ 17384354 +1738435 204400 1050936 4294170 
Canonical-CGP 663822* 838546 +83854 135162 337950 710275 
(u+A)-CGP 7780751* 15830735 +1583073 197284 1830312 6318740 
Koza-2 (1 +4)-CGP 8264426 19894512 +1989451 150140 888884 4378756 
(1 +A)-CGP 8191549 20275790 +2027579 94290 559028 4710848 
Canonical-CGP 4441 18+ 95000 +286700 627550 29650 78800 
(u+A)-CGP 5729778 11021660 +1102166, 238156 1320880 5878696 
Koza-3 (1 +4)-CGP 600153 1214527 +121452 39076 177418 443038 
(1 +A)-CGP 753551 2535215 +253521 29528 120368 431318 
Canonical-CGP 328707 57156 +10435 2488 6700 32713 
(u+A)-CGP 926857 3473467 +347347 28548 121040 362180 
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Table 8: Results of the algorithm comparison algorithm for the symbolic regression problems evaluated with the best-fitness-of-run method 


Problem Algorithm Mean Best Fitness SD SEM 1Q Median 3Q 
Keizers (1+4)-CGP 3,78 2,61 +0,26 2,16 3,24 4,59 
(1+A)-CGP 3,38 2,52 +0,25 2,41 3,03 3,158 
Canonical-CGP 2, sit 1,13 +0,11 1,78 2,90 3,75 
(u+A)-CGP 2, 8st 1,09 +0,1 2,25 3,14 3,15 
pases (1+4)-CGP 128,18 48,19 +4,81 87,81 119,09 161,08 
(1+A)-CGP 120,75 44,95 +4,49 86,14 120,91 155,06 
Canonical-CGP 98, 52+ 50,57 +5,08 59,04 85,31 130,04 
(u +A)-CGP 99, 74+ 41,246 +4,12 65,32 95,79 131,76 
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